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~ FLYNN‘S CLASSIFICATION, SYSTEM ATTRIBUTES TO 
PERFORMANCE, PARALLEL COMPUTER MODELS- 


Q.1,-Describe the Flynn’s classification based on the multiplicity of 
insifuction streams and data streams in a computer system with the neat 
diagrams. (R.GPV., Dec. 2003, 2004) 


Or 
Explain Flynn’s classification based on multiplicity of instruction 
streams and data streams. (R.GP.V., Dec. 2007, June 2011, 201 6) 


Or 

Explain different computer models briefly. (R.GP.V., June 2012) 
. Or 

_ Discuss Flynn’s classification. (R.GPRV, Dec. 2016) 
Or 


Explain Flynn’s classification of computer with the help of neat 
diagrams. (R.GP.V., Dec. 2017) 


Ans. The normal operation of a computer is to fetch instructions from 
memory and execute them in the processor. The sequence of instructions read 
from memory, as executed by the machine, constitutes an instruction stream. 


The operations performed on the data in the processor constitutes a data 
stream. A data stream is a sequence of data having input, partial or temporary 
results, called for by the instruction stream. 


According to the multiplicity of instruction and data streams, digital 
computers may be classified into four categories. Michael J. Flynn introduced 
this scheme for classifying computer organizations. Flynn’s four machine 
organizations are listed below — 

(i) Single instruction stream-single data stream (SISD) 

(ii) Single instruction stream-multiple data stream (SIMD) 
(iii) Multiple instruction stream-single data stream (MISD) 
(iv) Multiple instruction stream-multiple data stream (MIMD). 
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() SISD Computer Organization — SISD is just the classical, 
Sequential Von Neumann computer. It has one instruction stream, one data 
Stream and does one thing at a time. In other words, SISD computers are 
capable of manipulating a single data item by executing one instruction at a 
time. Fig. 1.1 (a) shows SISD computer organization. This organization 
represents most serial computers available today. Instructions are run 
sequentially but may be overlapped in their execution stages. SISD uniprocessorg 
“are generally pipelined. There may be more than one functional units in an 
SISD computer. The control unit does supervision job of all functional units. 
- The SISD classification covers the conventional uniprocessor systems such 
"asthe VAX-11 and IBM 370 computers. CDC 6600 and IBM 370/168 computers 

are typical examples of SISD systems with multiple functional units. 

(ii) SIMD Computer Organization — SIMD machines have a single 
control unit that issues one instruction at a time, but they have multiple ALUs 
to carry it out on multiple data sets simultaneously. The SIMD systems allow 
a single instruction to manipulate several data elements. These machines are 
called vector machines or array processors. As illustrated in fig. 1.1 (b), there 
are multiple processing elements supervised by the same control unit. All PEs 
receive the same instruction broadcast from the control unit but operate on 
different data sets from distinct data streams. The shared memory subsystem 
may contain multiple modules. 

Examples of this type of computers are the ILLIAC-IV and Burroughs 
Scientific Processor (BSP). The ILLIAC-IV was an experimental parallel 
computer proposed by the University of Illinois and built by the Burroughs 
Corporation. This configuration is very useful for carrying out a high volume 
of computations that are encountered in application areas such as finite-element 
analysis, logic simulation and spectral analysis. 


(iii) MISD Computer Organization — MISD is a computer in which 
several instructions manipulate the same data stream concurrently. This concept 
is only of theoretical interest, as no system has been built using this idea. 
However, the notion of pipelining is very close to the MISD definition. 

Fig. 1.1 (c) shows the MISD computer organization. There are n 

| processor units. Each unit receives different instruction operating over the 

| same data stream and its derivatives. The output (results) of one processor 

| becomes the input of the next processor in macropipe. This structure has 
attracted much less attention and has been challenged as impractical by some 
computer architects. There is no real embodiment of this class. 


(iv) MIMD Computer Organization — MIMD organization refers to 


‘a computer that is capable of processing several programs simultaneously. 
MIMD are just multiple independent CPUs operating as part of a larger system. 


I 
| 
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| 
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i parallel processes fall into this category. Both multiprocessors and 
Most omputers are MIMD machines. An instrinsic MIMD computer specifies 
multe tion among the n processors as all memory streams are derived from 

oe data space shared by all processors. If the n data streams were 
d from disjoined subspaces of the shared memories, then we would 


IVE . . A 
cae the so called multiple SISD (MSISD) operation, which is a set of n 
s pen dent SISD uniprocessor systems. 
inde 
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Stream Stream 
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I/O Unit 
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(c) MISD Computer 
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Processor 
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ssor {Stream 2 
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Stream n 


Instruction 
Stream n 
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Instruction Stream 2 
Instruction Stream 1 


(d) MIMD Computer 


Fig. 1.1 Flynn's C lassification of Various Computer Organizations 


An intrinsic MIMD computer is tightly coupled if the degree of instructions 
among processors is high. Otherwise, we consider them as loosely coupled. 
Most commercial MIMD computers are loosely coupled. With the invenéfon 
of low-cost microprocessors, the idea of building multiprocessor systems is 
practically feasible. Carmegie Mellon University’s C,,* and Cray Research’s 
Cray-2 computers are examples of popular multiprocessor systems. 
characteristics of MIMD multiprocessors that 
(R.GP.V, Dec. 2014) 


O.2. Describe at least four 
distinguish them from multiple computer systems. 
Ans. There are four characteristics of MIMD multiprocessors that distinguish 


from multicomputer as follows — 
Multicomputer 
A computer consists of several 


computers, similar to parallel 


computing. 
A multicomputer can run faster. 


Multiprocessor 


A multiprocessor system is easy a 
computer which has more than one 
CPU on its motherboard. 

A multiprocessor would run slower, 
because it would be in one computer. 
Multiprocessors have a single 
physical address space (memory) 
shared by all the CPUs. 

A multiprocessor is a single system 
with multiple CPUs. 


Multicomputers have one physical 
address space per CPU. 


A multicomputer is multiple com- 
puters, each of which can have 
multiple processors. Used for true 
Parallel processing. 


_ Including the instruction count (le), th 
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(R.GP. 
Ans. High performance computers are needed to provide He ae 2016) 
solutions: High performance computers are increasingly in demand in oe 
ofstructural analysis, artificial intelligence, remote sensing, weather Dee 
expert systems, genetic ee military defense, industrial woran. 
among many other scientific and engineering applications. Without super ss 
computers; many of these challenges to advance human civilization cannot Be 
made within a reasonable time period. Achieving high péttommaidelde ee 
g faster and more reliable hardware devices but also on ae 


only on using 
ts in computer architecture and processing techniques 


Q.3- What is the need of high performance computers ? 


not 
jmprovemen 
Q.4. Explain the following terms to measure performance of computer 


system == 
(i) Clock rate and CPI (cycle per instruction) 
(ii) MIPS (million instructions per second) rate 
(iii) Throughput rates 
(iv) Performance factors. 
(R.GP.V., June 2004, 2011) 
Ans. (i) Clock Rate and CPI (Cycle per Instruction) — A clock with a 
constant cycle time (Tin nanoseconds) is used to drive the CPU of today’s 
computer. The clock rate (f = 1/t in megahertz) is the inverse of the cycle 
time. Program size is calculated by the instruction count (1,) of a program, in 
terms of the number of machine instructions to be run in the program. The 
different numbers of clock cycles may be needed by different machine 
instructions to execute. Hence, the cycles per instruction (CPI) is an important 
parameter for find out the time required to run each instruction. SE 
An average CPI over all instruction types can be computed for a 
given instruction set, given we know their frequencies of appearance in 
the program. A precise estimate of the average CPI needs a huge amount 
of program code to be traced over a long period of time. Generally, the 
term CPI refers the average value with respect to a given instruction set 


and a given program mix. 
(ii) MIPS (Million Instructions per Second) 


Hie total number of clock cycles required to run a program. 
time is determined as T = C x t = C/f. Moreover, CPI = C/I. 
peed is measured in million 


T =I, x CPI x t =I, x CPI/f. Mostly, the CPU s 
MIPS rate of a given 


instructions per second (MIPS). This is called as 
number of factors, 


processor. The MIPS rate changes with respect to a 
e clock rate (f) and the CPI of a given 


Rate — Suppose C be 
Then, the CPU 
and 


adie 
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machine, as given below — 
xl, 


le f 
Tx10f CPIx10f Cx10° 

Then, the CPU time can also be written as T = lẹ x 107%%/ MIPS, It 
means that the MIPS rate of a computer is directly proportional to the Clock 
rate and inversely proportional to the CPI. The system attributes like compiler 
* . ` ’ 
instruction set, processor and memory technologies, influence the MIPS rate 
i ’ 


MIPS rate = 


which changes also from program to program, 
(iii) Throughput Rate — System throughput (Ws) (in programs/ 
second) is a measure of how many programs a system can execute per unit 


time. In a multiprogramming system, the system throughput is mostly smaller 
compared to the CPU throughput Wp given by — 


f 
A — 
Wp Tox CPI 


Here, W, = (MIPS) x 10/I,. The W, is measured in programs/second. A 
measure of the CPU throughput is how many programs can be executed per 
second, depending on the MIPS rate and average program length (I,). The 
reason why W, is lower than W, is because of the additional system overheads 
caused by the VO, compiler and OS when multiple programs are interleaved for 
CPU execution by multiprogramming or time sharing operations. The W, = W, 
if the processor is kept busy in a perfect program-interleaving manner. This will 
probably never take place, as the system overhead often causes an extra delay 


and the processor may remain idle for some cycles. 

(iv) Performance Factors — Suppose |, be the number of instructions 
in a given program or the instruction count. We can estimate the CPU time (T 
in seconds/program) required to execute the program by finding the product 


of three contributing factors — 
T=I1, x CPI xt 
The execution of an instruction needs passing through a cycle of events 
including the instruction fetch, decode, operand (s) fetch, execution and store 
results. In this cycle, only the instruction decode and execution phases are 
performed in the CPU. The rest of the operations may be needed to access the 
memory. The memory cycle is defined as the time required to complete one 
memory reference. Normally, a memory cycle is k times the processor cycle 
t. The value of k relies on the processor-memory interconnection scheme 
used and speed of the memory technology. 
nae sea oe i aie ae into two component terms. These 
db ih thes execution oft processor cycies and memory cycles required 
n of the instruction. The entire instruction cycle may 


Unit-1 9 
geons O four TERR, references (one for instruction fetch, two fi 
id fetch, and one for store results) based on the instruction type Thus, 
be rewritten as follows ee 
T= lL x(p+mxk) xr 
denotes the number of processor cycles required for the instructi 
NRE AE of memory references auie 


ratio between memory cycle and processor cycle, I, denotes the 
and t denotes the processor cycle time. 


inclu 
open! 


where P 
decode an 
k denotes the 
uction count, 

5, What do you understand by the performance of the pipeline ? 


TrA the measures used for measuring the program. 
(R.GP.V., June 201 7) 


ins 


Ans. A pipeline’s performance can be measured by its throughput in 


terms of millions of instructions executed per second or MIPS. Another popular 


measure of performance is the number of clock cycles per instruction or CPL 
These quantities are related by the equation 


we 
MIPS 
where f is the pipeline’s clock frequency in MHz and the values of CPI and 


a ae SS a ene 


MIPS are average figures that can be determined experimentally by 


MIPS 


CPI= A 


processing suites of representative programs. The maximum value of CPI 


for a single pipeline is one, making the pipeline’s maximum possible 


throughput equal to f. 
Space-time diagram is a useful way to visualize pipeline behaviour, which 


shows the utilization of each pipeline stage as a function of time. In general, a 


space-time diagram for an m-stage pipeline has the form of an mxn grid, 
where n is the number of clock cycles to complete the processing of some 


sequence of N instructions of interest. 
Another general measure of pipeline performance is the speedup S(m) 


defined by ~~ 
a ue ERE 


S(m) = PA ge m 
T(m) 3 TEA) 
where T(m) is the execution time for some target workload on an m-stage 


Pipeline and_T(1) is the execution time for the same workload on a similar, 
nonpipelined processor. It is reasonable to assume that T(1) < mT(m), in 


which case S(m) < m. Sore O Tt) 2m Toad 
Measures Used for Measuring the Program — Refer to Q.4. 
Q.6. What is pipeline CPI ? (R.GPV., June 2016) 
Ans. Refer to Q.5. 
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Q.7. Explain how instruction set, memory hierarchy, compiler technology 
CPU implementation a nd control effect the CPU performance and justify th ? 
e 


effect in terms of program length, clock rate and effective CPI. 
(R.GP.V,, Dec. 2016) 


Or 
Explain how instruction set, compiler technology, CPU implementation 
and control, and cache and memory hierarchy affect the CPU performance 


cts in terms of program length, clock rate and effective 


and justify the effe 
(R.GPV, June 2013) 


CPI. 
Ans. The instruction set, memory hierarchy, compiler technology 
$ 


CPU implementation and control influences various performance factor of 
CPU like program length (instruction count, Ic), processor cycle needed 
(p), memory reference count (m), access latency (k), and processor cycle 
ume (T). 
(i) Instruction Set — lt affects the instruction count (Ic), which 
affects the overall program length. It also affects the processor cycle needed 


(p) for execution. 
Since, CPl=p+m*xk 
So. variable instruction set also affects the effective CPI by changing 


the number of processor cycles. 
(ii) Memory Hierarchy — Memory hierarchy design affects the 
memory access latency (k) and processor cycle time (t). 
Since, CPl=p+m*xkand © 


clock rate f = 1/t. 
So, memory hierarchy affects the effective CPI and clock rate. 


(iii) Compiler Technology — Instruction count (Ic) and memory 
reference count (m) depends on compiler technology. Since Iç is program 


count and CPI = p + m x k. 
Hence, compiler technology affects the program count and effective 


CPI. 
(iv) CPU Implementation and Control — Total processor time (p.t) 
is provided by CPU implementation and control. So, processor cycle per 
instruction (p) and processor cycle time (t) are affected by CPU implementation 


and control. 
Since, CP1= p+ m x k and clock rate f= 1/t. Hence, CPU implementation 


and control affects the clock rate and effective CPI. 


` Unit -I 
Table 1.1 Performance Factors Versys System Attrit 
Wutes 


Average Cycles per Instruction, CP] 


Memory Processor 
Access Cycles per 
Latency, k| Instruction, 


Instr. 
Count, 


Processor 


Me. mory 
References per 
P| Instruction, m 


{nstruction-set 
Architecture 


Memory 
Hierarchy 
Processor . 
Implementation 
and Control 


Q.8. State the CPU performance equation and discuss the factor that 


affect performance. ` (R.GRYV. Dec. 2017) 


Ans. The CPU Performance Equation — Refer to Q.4. 
The Factor that Affect Performance — Refer to Q.7 


0.9. Differentiate various parallel computer models with examples. 
Ans. Differences between various computer models with examples are 


given below — 


Class of 
Decompo- 
sition 


Example 
Implementation 


Class of 
Interaction 


D, Erlang, Scala, 
SALSA 


Asynchronous 
message passing 


ere Shared memory Apache Giraph, 
Apache Hama 
Communicati i 
ng | Synchronous Ada, Occam, 
message passing VerilogCSP, Go 


sequential pro- 
cesses 
Circuits 
Dataflow 


Verilog, VHDL 
Lustre, Tensor- 
Flow, Apache Flink 


Message passing 
Message passing 
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Concurrent Has. 
kell, Concurrent 
ML 


None 


Implicit 


Functional 


Not 
specified 


Data 


Synchronous 
message passing 
Shared memory 


LogP Machine 


Cilk, CUDA, Open 
MP, Threading 
Building Blocks 
XMTC 
Store and forward 
routing and worm- 
hole routing sche- 
mes. 


Parallel random 
access machine 


Synchronous 
message passing 
& asynchronous 
message passing. 


does multiprocessors differ from multicomputers ? Describe 
sed on shared-memory. 
Or 
Distinguish between m ultiprocessors and multicomputers based on their 
structures, resource Sharing and interprocess communications. Also explain 
the differences among UMA, NUMA and COMA computers. 
(R.GP.V., June 2010) 


2.10. How 
different multiprocessor models ba 


Or 
rocessor and multicomputer briefly. 


Or 
seen multiprocessors and multicomputers based on their 


and interprocessor communication. 
(R.GP.V., June 2016, May 2018) 


rocessors and multicomputers are distinguished by having a 


or unshared distributed memories. 

The three shared-memory 
del, the 
emory 
s of 


Explain multip (R.GP¥., Dec. 2015) 


Distinguish bem 
structure, resource sharing 


Ans. Multip 
shared common memory 

Shared-memory Multiprocessors — 
multiprocessor models are the uniform memory access (UMA) mo 
nonuniform memory access (NUMA) model, and the cache-only m 
architecture (COMA) model. All these three models are distinct in term 
how the memory and peripheral resources are shared or distributed — 

(i) The UMA Model— Fig. 1.2 shows the UMA model. In this model, 
the physical memory is uniformly shared by all the processors. All processors 
have same access time to all memory words, that is why it is known as uniform 
memory access. Each processor may have a private cache. Peripherals are also 

ring, 


shared in some manner. Because of the high degree of resource sha 
multiprocessors a tems. The system interconnect 


has the form of a c Itistage network. 


eo 


re known as tightly coupled systems 
witch or a mu 


ommon bus, a crossbar s 
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Fig. 1.2 The UMA Multiprocessor Model 


ro riate for general-purpose and time-sbarin 
ple users. In time-critical applications it is used to speed 
ecution of a single large program. To coordinate parallel events, the 
communication among processors are achieved throush 
--bles in the common memory to coordinate parallel events. 

as a symmetric multiprocessor when all processors 
al access to all peripheral devices. In this situation, all the processors 
able to run the executive programs, like the OS kernel and I/O 
aie routines. When only one or a subset of processors are executive- 
capable, the system is called an asymmetric multiprocessor. An executive or a 
master processor is responsible for running the operating system and managin 

1O. The rest of the processors have no V/O capability and therefore are el 
as attached processors (APs). The user codes are executed by the attached 
processors under the supervision of the master processor. Memory sharin 

among master and attached processors is still used in both master asa 


and attached processor configurations. 
oe is 

(ii) The NUMA Model — A NUMA (Nonuniform-memory-access) 
multiprocessor is a shared memory system. In this model, the aec i 
varies with the location of the memory word. Fig. 1.3 ows a TIMA 
Ha A All processors, have their local memories. All local memorie 
collectively makes a global address space accessi inio i 

ssible by all processors. A 

a, accesses a local memory very fast. Because of the added ak s 
l e interconnection network, the access of remot < 
o other processors takes longer. ee 


The system is known a 


have equ 
are equally 


[o 
© 
$ rocessor 2 Processor n 
sf 
Sa 
2° EA 
5 ocal L 
2 
Me ocal L 
Ez mory 1 Memory 2 Manors n 


(a) Shared Local Memories 
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Memory 3 


Global Shared 
Memory 2 . 
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Memory 1 


(b) A Hierarchical Cluster Model 


Fig. 1.3 NUMA Models for Multiprocessor Systems 

Apart from distributed memories, globally shared memory can be added 
to a multiprocessor system. In this situation, there exist three memory access 
patterns — The local memory access is the fastest. The global memory access 
is the next. The access of remote memory is the slowest. 

Fig. 1.3(b) shows a hierarchically structured multiprocessor. The 
processors are partitioned into a number of clusters. Each cluster is either an 
UMA processor or a NUMA multiprocessor. The clusters are connected to 
global shared memory modules. The whole system is thought of a NUMA 
multiprocessor. All processors related to the same cluster are permitted to 
uniformly access the cluster shared memory modules. All clusters can equally 
access the global memory. Although, the access time to the cluster memory is 
lower than that to the global memory. 

(iii) The COMA Model — In the COMA model, a multiprocessor 
uses cache-only memory. A special case of a NUMA machine is the COMA 
model. The distributed main memories are changed to caches in the COMA 


| Directory 2} 2 | Directory 3 | 3 


Processor 3 


[Processor 1| 
| Cache 1_| 


Directory 1 


Interconnection Network 


Fig. 1.4 The COMA Model 
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No memory hierarchy is present at each processor node. All 
bal address space. Remote cache access is provided b th ue eines 
ories (D in fig. 1.4). Sometimes hierarchical direst es 
ate copies of cache blocks based on the interests E 
_Jnitial data placement is not difficult since data will eventually Fash 
0 


ce where it will be utilised. 


the pla : 
pistributed-memory Multicomputers — Multicomputers have unshared 


buted memory. Fig. 1.5 shows a distributed-memory multicomputer 

tem composed of multiple computers, known as nodes fie ier 
-passing network. Each node is an autonomous compuiter cone 
of a processor, local memory and sometimes attached disks or I/O nota 
The point-to-point static connections are provided among the nodes b the 
message-passing network. All local memories are accessible only b to i 
processors because they are private. That is why traditional ilicom pi 
have been referred to as no-remote-memory-access (NORMA) na 
Although, this restriction will slowly be eliminated in future multicom ae 
with distributed shared memories. Internode communication is E È 
passing messages through the static connection network. i 


distri 
The sys 
by a message 


Memory Memory 


Processor Processor 


Message-passing 
Interconnection Network 
(Mesh, Ring, Torus, 
Hypercube, Cube-connected 
Cycle etc.) 
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Fig. 1.5 Model of a Message-passing Multicomputer - 


Q.11. Differentiate between the COMA and NUMA. (R.GEV, Dec. 2010) 
Ans. Refer to Q.10. 


Fi ; eke oi 
Q.12. Explain the architectural operations of SIMD and MIMD 


computers, Distinon; 
on their + Distinguish between multiprocessor and multicomputers based 
tals i (R.GPV, June 2011) 


Ans. Refer to Q.1 and Q.10. 
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Q.13. Draw and explain the organization of multiprocessor SYsten 
„13. stem, 


Ans. ‘The multiprocessor systems ate oie at enhancing through 
reliability, flexibility and avali ave ie nogo or more PrOCEggy 
of approximately comparable capabi o i pi aani used to share 
to common sets of memory modules, /O ¢ ne ean Peripheral q 
Here, the entire system 15 managed by 3 omg E operating SYsle 
providing interactions between poyceasors + their Programs at differen 
levels. In addition to the shared memorics anc I/O devices, cach Process, 
contains its local memory and private devices, Interprocess 
communications is provided through the shared memories or through i: 


Puy 


, 
ty 
ACCgs, 
CVicg, 


interrupt network, 

Mainly determine the organization of multiprocessor hardware System 
the interconnection structure between the memories and processors, In fe 
past, three different interconnections have been used — 

wry 
(i) Time shared common bus 


(ii) Crossbar switch network 
(iii) Multiport memories. 


Fig. 1.6 shows a multiprocessor system. 
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Shared Memory 


Memory 
Module 1 


Interprocessor-memory 
Connection Network 
(Buses, Crossbar, or Multiport) 


Memory 


Input-output 
Module 2 p P 


Interconnection 
Network 


Memory 


— 
VO Channels 


Fig. 1.6 MIMD Multiprocessor System 
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Explain the multivector computers (or vector 
us vector supercomputer models, 
Or 


jaiei he vector supercomputer architecture with neat diagram, 
(R:GP.V, June 2013, 2015) 


Q14. 


, supercomputers), 
discuss varto 


A [s05 


Or 
Describe about vector supercomputer architecture. (R.GP.V., June 2014) 


Ans. A vector computer is mostly made on top of a scalar processor. 
rig 1-7 shows that, the vector processor is attached to the scalar processor 
as an optional feature. First program and data are loaded into the main memory 
through a host computer. The scalar control unit first decodes all instruction. 
Ifthe decoded instruction Is a scalar operation or a program control operation, 
the decoded instruction will be directly executed by the scalar processor using 
the scalar functional pipelines if it is a scalar operation or a program control 
operation. The instruction will be sent to the vector control unit if it is decoded 
as a vector operation. The flow of vector data between the main memory and 
vector functional pipelines is supervised by this control unit. The control unit 
coordinates the vector data flow. A number of vector functional pipelines may 
be made into a vector processor. 


IO (User) 


i 


Vector Processor 


Vector Functional 
Pipeline 


Main Memory 


(Data and Program) | Vector Data | | 


Vector Registers 


Vector Functional 
ra Pipeline 


Vector Control 
Unit 


Control Unit Vector 
Instructions 


Scalar 
Instructions 


Scalar 
| Functional 
Pipelines 


Scalar Processor 


Fig. 1.7 Vector Supercomputer Architecture 


sh Vector Supercomputer Models — A register-to-register architecture is 
f i in fig. 1.7. Vector registers are capable of storing the vector operands, 
mtermediate and final vector results. The vector functional pipelines take operands 


fi : 
rom and keep results into the vector registers. All vector registers can be 


SO 


yut 
be 
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d in user instructions. Each vector register contains a ç 
unter keeps track of the component registers used in 
-veline cycles. Usually each vector register is of fixed length, for ey 
pipe In Se registers in a vector register in a Cray Series supe 
us aly the numbers of vector registers and functional pipelines 
ee fixed. Therefore, both resources must be reserved jn 
vet of resource conflicts between various vector operations, 


A memory-to-memory architecture anda register-to-register architecture is 
di ffarent from each other. They differ in the use ofa vector stream unit to replace 
the vector registers. Vector operands and results are accessed directly from the 
n memory in superwords, for example, 512-bits as in the cyber 205 


0.15: Define the following — a 


i (i) Multiprocessors (ii) Multicomputers (iii) Multivector. 
- ~ à (R. GP V, June 2012) 


programme 


omPonent 
counter. This co 


SUCCEssiy, 
ample, 64. 
Tcomputey 
in a vector 
advance to 


mai 


“Ans. (i) Multiprocessors — Refer to Q.13. ee k 
` (ii) Multicomputers — Refer to Q.10. 


ne (iii) Multivector — Refer to Q.14:'- 


_. Q.16- Explain the SIMD computer model... 
. Li Or >> 
Write short note on SIMD super computer. : (R:GÈV., May 201 8) 


. Ans.-An operational model of SIMD computers is shown in fig. 1.8 based 

on the work of H.J. Siegel (1979). This model can be specified by a 5-tuple - 
M= (N, C, I, M, R) i eos 

where, N denotes the number of processing elements (PEs) in the machine. 

C denotes the sét of instructions diréctly executed by the control unit 
(CU), including scalar and program flow control instructions. 

I denotes the set of instructions sent by the CU to all PEs for parallel 
execution. These are arithmetic, logic, data routing, masking and other local 
operations executed by-each.active PE over data within that PE. 

~..M-denotes: the set ~. Bete 

of masking schemes, where 
each mask divides the set of 
PEs into enabled and disabled 
subsets, © °° °° 

` R denotes the set of 
data-routing functions, 
specifying different patterns l 
to be established in the 
Interconnection network for 
inter-PE communications. Fig. 1.8 SIMD Computers Operational Model 
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TETE 


` NUMERICAL PROBLEMS 


ALAA DINS 


ob.1. Consider the execution of an. object code with 200000 
‘ons on a 40 MHz processor. The program consists of four major 
„structions. The instruction mix and the number of cycles (CPI) 


pes of ins . F i below b 
yy ded for each instruction type are given below based on the result of a 
nee 


program trace experiment — 
Instruction Type- 


Pr 


Instruction Mix 


Arithmetic and logic 
Load/Store with cache hit 


Branch 
Memory reference with cache miss 


Calculate average CPI and MIPS, 


(R.GP.V., Dec. 2010) 
1 x 0.60 + 2 x 0.18 + 4 x 0.12 +8 x 0.10 


Sol. Average CPI 


= 2.24 cycles/instruction Ans. 
40 ; l 
MIPS rate = z324 7 17.86 MIPS Ans. 


Prob.2..A 400 MHz processor executing an object code with 2x 10° - 
instructions. The program consists of four major types.of instructions. The 
instruction mix and the number of cycles (CPI) needed for each instruction 
type are given below — 


| CPT | Instruction Mix. 


Instruction Type 
Arithmetic and logic 
Load/Store with cache hit 
Branch 
Memory reference with cache miss 


(i) Calculate the average CPI when the program is-executed on a 
processor. eee 


(ii) Calculate the MIPS rate. hg a Ee 

(R.GBV, June 2013) 
1 x 0.60 + 2 x 0.18 +4 x,0.12,+ 8.x 0.10 
, ’ “Ans. 


ll 


Sol. (i) Average CPI 


ll 


2.24 cycles/instruction 


400 7 a ial HAANS. 
——— =178.57 MIPS oo 
224 7" 


(ii) MIPS rate 


ae T 


Jut 
be 


Poe oe 
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Prob.3. A 40 MHz processor was used to execute bench mark 
with the following instruction mix and clock cycle counts, 


Instruction Type 


Progran, 


Instruction Count 


Integer arithmetic 
Data transfer 
Floating point 
Control transfer 
Determine the effective CPI, MIPS rate and execution time. 

(R.GRY, Dec, 2015) 
Sol. Effective CPI = 1 x 0.45 + 2 x 0.32 +2 x 0.15 +2 x 0.08 


= 1.55 cycles/instruction Añ 
s. 
40 
= —— =25.8 MIPS 
MIPS rate 155 25 Aik 
Execution time = I, x CPI x t 
1 
= 100x 1.55x——— = 3.875 x 10% 
‘40 x 108 ete 

= 3.875 ns ` Ans. 


DATA AND RESOURCE DEPENDENCES, HARDWARE AND _ 
SOFTWARE PARALLELISM, PROGRAM PARTITIONING AND 
SCHEDULING, GRAIN SIZE AND LATENCY, CONTROL FLOW, 
“DATA FLOW AND DEMAND DRIVEN MECHANISMS 


0.17. Explain the following — 
(i) Data dependence (ii) Control dependence 


(iii) Resource dependence. 
(R.GPR.V., June 2004) 
Ans. (i) Data Dependence- The data dependence indicates the ordering 
relationship between statements. There are five types of data dependence 
defined below — 

(a) Flow Dependence — Flow dependence is represented as 
S1 — S2. It means that a statement S2 is flow dependent on statement S1 if 
an execution path exists from S1 to S2 and if atleast one output (variables 

assigned) of S1 feeds in as input to S2. 
(b) Antidependence — Antidependence is denoted as S! 
S2. It means that statement S2 is antidependent on statement s1 if S2 


ve S1 in’program order and if the output of S2 overlaps the input t0 


+> 
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(c) Output Dependence — S1 e—> S2 denotes output 
ce from S1 to S2. It means that two statements are output dependent 
uce (write) the same output variable. 

(a) 1/0 Dependence — The I/O statements are read and write. 
ndence takes place not because the same variable is included but 
1/0 ae same file is referenced by both I/O statements. 
ad (e) Unknown Dependence — The dependence relation cannot 
tween two statements in the following cases — 
(1) The subscript has not the loop index variable. . 
(2) The subscript of a variable is itself subscribed (i.e., 


be determined be 


indirect addressing). e S oe r 
(3) The subscript is nonlinear in the loop index variable. 


(4) A variable appears more than once with subscripts having 
various coefficients of the loop variable. 

When one or more of these conditions occur, a conservative assumption 
is to claim unknown dependence among the statements involved. 

For example, consider the following instructions — 

S1 : Load R1, A _,/R1 < Memory (A)/ 
S2 : Add R2, RI /R2 < (R1) + (R2)/ 
S3 : Move R1, R3 /R1 + (R3) 

S4 : Store B, R1 /Memory (B) <+ (R1)/ 

Fig. 1.9 (a) shows that S2 is flow-dependent on S1 as the variable A is 
passed through the register R1. S3 is antidependent on S2 due to potential 
conflicts in register content is R1. S3 is output-dependent on S1 as they both 
modify the same register R1. 

Next, we consider the following I/O operations — 

S1 : Read(4), A(I) /Read array A from tape unit 4/ 
S2 : Rewind (4) /Rewind tape unit 4/ 
S3 : Write (4), BOD /Write array B into tape unit 4/ 
S4 : Rewind (4) /Rewind tape unit 4/ 

Fig. 1.9 (b) shows that the read/write statements, S1 and S3 are 

TO dependent on each other as they both access the same file from tape unit 4. 


Q 
PEEN i er 
Q 


(a) Dependence Graph 


< 4l 


ut 
de 


(b) I/O Dependence 
Fig. 1.9 Data and I/O Dependences 


~ 
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Ss. (ii) Control Dependence — Control dependence d 
situation where we cannot determine the order of executi 
before run time. For instance, conditional statements he statemen 
cannot be resolved until run time. Various paths taken after a i, Fah In Fortran, 
may remove or introduce data dependence among instructions a branch 
may also present between operations done in successive iteration a Endencg 
method. Now, we give a loop example with and other a ofa looping 
dependent iterations. Fhefollowing loop has control-independent : eontro, 
DO 301=1,N erations, 
A(D = C(I) 
If (A(D. LT.0) A(D = 1 
30 continue 
The successive iterations of the following loop are control dependent _ 
DO 20I=1,N 
If (A(1 - 1). EQ .0) AQ = 0 
20 continue 

Control dependence prevents parallelism from being exploited. Compiler 
schemes are required to get around the control dependence to exploit more 
parallelism. 

(iii) Resource Dependence — Resource dependence is thought of as 
the conflicts in using shared resources, like integer units, floating-point units, 
registers and memory areas, among parallel events. If the conflicting resource 
is an ALU, it is called as ALU dependence. 

When the conflicts include workplace storage, it is known as Storage 
dependence. In the storage dependence situation, every task must work on 
independent storage locations and use protected access (like locks or monitors) 
to shared variable data. 

A sequentially coded program can be transformed into a parallel executable 
form manually by the programmer using explicit parallelism or by a compiler 
detecting implicit parallelism automatically. In both methods, the primary 
objective is the decomposition of programs. 

Program partitioning decides whether a given program can be split into 
pieces that can execute in parallel or follow a definite prespecified order of 
execution. Some programs cannot be decomposed into parallel branches 
because they are inherently sequential in nature. The parallelism detection ın 
programs needs a check of the different dependence relations. 


0.18. Write short notes on the following — 
(i) Hardware parallelism (ii) Software parallelism. 
Or 


Explain hardware and software parallelism. (R.GB¥., June 2012) 


Or 
oftware and hardware parallelism. 
(R.GP.V, Dec. 2015) 


pipferentiate between s$ 


Or 
vare and software parallelism. 
(R.GPV, June 2016) 
i ism is defined by the 
Parallelism — Hardware parallelism is de 1 
e and hardware multiplicity. It is a function of cost and 
e 


Briefly describe hardy 


Ans. À 
hine archit 
rmance trad 
neously executabl i 


ssor resources. e 
n sor one approach to characterize the parallelism is by the 


f instruction issues per machine cycle. A processor 
aac when it issues k instructions per machine cycle. 
LS iene e machine cycles are taken by a conventional processor to 
i n a anii These kinds of processors are known as one-issue 
Peroa gle instruction pipeline in the processor. Two or more 


; i sin 
achines, with a suon i 19 
ductions can be issued per maċhine cycle in a modern processor. The Intel 
in A Á one 
i960CA, for instance, is a three-issue processor capable of issuing one memory 
2 i - 


access, one arithmetic and one branch instruction issued per cycle. An example 
of four-issue processor is the IBM RISC/System 6000 with one memory access, 
one arithmetic, one branch instruction and one floating-point issued per cycle. 

A multiprocessor system constructed with n k-issue processors should 
be capable of handling a maximum number of nk threads of instructions 


mac. 
perfo 
simulta 
of the pro 

In a proces 


simultaneously. vgs l 
(ii) Software Parallelism — Software parallelism is defined by the 
control and data dependence of programs. It is a function of algorithm, 
programming style and compiler optimization. The degree of parallelism is 
shown in the program profile or in the program flow graph.-The program 
flow graph shows the patterns of simultaneously executable operations. During 
the execution period, parallelism in a program varies. It oftén restricts the 
sustained performance of the processor. There are two main‘typés of software 
parallelism. The first is control parallelism, which permits two or more 
Operations to be performed simultaneously. The second is data parallelism in 
Which almost the same operation is performed over many data elements by 
many processors simultaneously. - wi Para - 


0.19. Explain the different levels of parallelism in program execution 
on computers, r, KART 


Or 
Define the various parallel processing levels. ` (R.GP.V., June 2017) 
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eoffs. It shows the resource utilization patterns of 
e operations. It also specifies the peak performan,- 


Pr 


wy 
ly 
Zer 


Jut 
be 
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Ans. At different processing levels, parallelism 
levels of program execution representing various computa “*Ploiteg N 
and changing communication and control needs ig shown in n Brain „t | 

Ig. ] leg 


the level is lower, the granularity of the software processes w 19, w 
(i) Instruction Level — A typical grain at this a Iner, gh 
Con 


than 20 instructions, known as fine grain in fig. 1.10. Dae 

parallelism at this level may range from two to thousands neha 
program. The benefit of fine-grain computation is Present in th On indivig 
parallelism. The exploitation of fine-grain Parallelism can be i abundane 
optimizing compiler which should be capable of automatioan db 
parallelism and translating the source code to a parallel via ly 
identified by the run-time system. Whi 


has been 


Cs 

fi N 

Ine Stain | 
ual | 

Cof | 

Y an 

detectin | 


Ch can be | 


| 


Level 5 Jobs or Programs 
? Coarse 

Grain z 

2 Subprograms, Job s 
Z| Leved4 Steps or Related PES 
€ Parts of a Program Qs 
cs . Ss 
Ë Medium St 
z Procedures, _Grain FE 
oj Level3 Subroutines, Z0 
E Tasks or Coroutines sP 
v = E 
a : ES 
ba Nonrecursive Piu 
£ Level 2 Loops or 3 3 
oD Unfolded Iterations ` E = 
x Fine 5 
Grain ga 

Instructions or z 

Level 1 Statements A 


Fig. 1.10 Levels of Parallelism 


(ii) Loop Level — Loop level corresponds to the iterative loop 
operations. A typical loop has less than 500 instructions. For pipelined 
execution or for lock-step execution on SIMD machines, some loop 
operations, if independent in successive iterations, can be vectorized. For 
parallel execution on MIMD machines, some loop operations can be self- 
scheduled. The most optimized program construct is loop-level parallelism 
to execute on a parallel or a vector computer, Although, recursive loops are 
hard to parallelize. Mostly vector processing is performed at the loop level 


by a vectorizing compiler. The loop level is still thought of a fine grain of 
computation. 


_ _ (iü) Procedure Level — Procedure level corresponds to medium- 
grain size at the task, procedural, subroutine and coroutine levels. At procedure 
level, a typical grain contains less than 2000 instructions. At this level, detection 


h more cO 


m Level — This level corresponds to the level of job 
ms. A typical grain at this level contain thousands 
d Aapa across various jobs. In SPMD or MPMD 
b ae scheduled for various processors, mostly on 
sigomputee: At this level, multiprogramming ae 
see ultiprocessor is conducted. Parallelism at this level has 
unip . oy algeria designers or programmers in the past, rather than 
peen Sn We have no good compilers for exploiting medium or coarse- 


lism at present. 
p ram) Level — This level corresponds to the parallel 


T i Ae jobs (programs) on a parallel computer. A typical 
a a be as high as tens of thousands of instructions in a single 
aaa parallelism is practical for supercomputers with 
i of very powerful processors. Job-level parallelism is handled 

the operating system and by the program loader. This level of parallelism is 
pred by time sharing or space sharing multiprocessors. 


(R.GEV., June 2015) 


Subprogra 


pro grams 


execution 
in at this 


0.20. What is instruction level parallelism ? 


Ans. Refer to Q.19 (i). 
0.21. Define how we can achieve the degree of parallelism ? 
(R.GPV., Dec. 2016) 


Ans. Degree of parallelism can be achieved by implementing following 
design goals — 
(i) Efficient parallelism managénient 
(ii) Portability 


(iti) Cheap communication 


`~ 


(iv) Target machine efficiency 
(v) Maximum parallelism amount. 


0.22. Explain grain packing and scheduling with an example for 
Parallel Programming applications. 


Or 


What is grain packing, coarse-grain and fine grain ? 
(R.GPYV, June 2015) 


Unit-1 25 


mplicated compared to that at the finer-grain 
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lowing program — 


J 
to the fo nist 
orres" a, b, © d, e, f, 8 h, i, j, K, 4, m, n, 0, P, q 
yr °?’ l 
pegin —_] 10.j:=exf 
|. a: 4 11. k:=dxf 
2, Pi 3 12. 1:=j*xk 
a 13.m:=4x/ 
f ao 14.n:=3 xm 
> = 15.0:=n%Xi 
8: g a xD 16.p:=o%xh. 
gh =cxd 17.q:=p*q 
9. i:=dxe 
End 
e (i.e., data fetch) operations are nodes from 1 to 6. 


Memory referenc 
Fach node takes six cycles to fetch from memory and one cycle to address. 
ac: 


The rest of the nodes from 7 to 17 are CPU operations, each needs two 
cycles to finish. The coarse-grain nodes have larger grain sizes ranging from 
4 to 8 after packing as depicted in fig. 1.11 (b), the node (A, 8) is received by 
combining the nodes (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1) and (11, 2) in 
fig. 1.12 (a). The grain size of node A is 8. It is the summation of all grain sizes 
(+itititit 1 + 2 = 8) being combined. 


Legends : 
(n, s) = (node, grain size) 
(x, i) = (input, delay) 


Hi (u, k) = (output, delay) u, k vh 
i (a) Fine-grain Program Graph (b) Coarse-grain Program 
Before Packing Graph After Packing 


Fig. 1.11 A Program Graph Before and After Grain Packing 


š Fig. 1.11 (a) shows 17 nodes in the fine-grain program graph and | 


fig. 1.11 (b) shows 5 nodes in the coarse-grain program graph. The coarse- | 


grain node is achieved by grouping multiple fine-grain nodes. The fine-grain (a) Fine Grain 


40} 
a2 HH 
[Fig. 41.11 (a)] 


Fig. 1.12 Scheduli (b) Coarse Grain [Fig. 1.11 (b)] 


ing of th i 7 
f the Fine-grain and Coarse-grain Programs 


To Do j F Architecture (VI-Sem) 
apply fine- a è high degre 

y tine-grain fi Sree of paralle 
Nodes in 


: -8&rain 
executi Pees delays Or decr Node in Order to 
on, all fine-grain ease the Overall : unnec 
to the same processo cPstations Within asingle Scheduling overhe a 
Commu Š r. Oarse- a Coarse-grain ad, Fo 
nication) co grain partition requ; node are all r 
i mpared t quires less [p Ocat 
Provides a t o fine- <i C (Int ed 
radeoff &rain partition, €rproce 
Overhead. between Parallelism Therefore, grain pan 
and Scheduling/com Packing 
Municat: 
ne As the communication delay is nication 
A ays rather than by delays within contributed Primarily 
ine-grain operatio : 

; ns withi . rnal d 
selection of the OPIKSI A As Same coarse-grain AIE ae s Bek among 
nodes on a il Size is meant to get the short Bible. The 

F parallel computer system. est schedule for the 

wo multiprocessor s 
fine-grain y chedules are shown in fig. 1.12 with r 
ersus COarse-grain Program graphs in f espect to the 
schedule is greater i.e., 42 time units ag ie In fig. 1.11. The fine-grain 
involved as depicted by ore communicatio 


in packing. 


Q.23. Distinguish between mediu 
in their architectures and programmi 


Ans. Refer to Q.22. 


Q.24. What is grain size and latency ? 
Or 
What do you mean by grain size and latency ? (R.GP,V., Dec. 2016) 
Ans. Grain Size — A measure of the amount of computation involved ina 
software process is called as grain size or granularity. The easiest measure is to 
count the number of instructions in a grain (program segment). Grain oe 
specifies the basic program segment selected for parallel processing. 


: essin 
commonly used grain sizes are fine, medium or coarse, based on the processing 
levels involved. 


(R.GP.V., Dec. 2015) 


but Bs incurred | 
Latency — A time measure of the communication overhead 


i time needed 
between machine subsystems is called as latency. For instance, ie Sone tit 
by a processor to access the memory is called the mes : 
synchronization latency is the time needed for two proce 
with each other. 


m grain and fine grain multicomputer | 
ng requirements. (R.GP V., June 2013) | 


ynchronize | 
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fi fi 


en itt detail the Vv R.GEV, Dec. 2016) 


. munication 
etrics for com 
rformance m 


i 
dwidth — It is controlled by interconn ee 
na ot by any aspect of communica i 


cessor, n 5 o 
: ee are locked up first in communicatio 
s of no 


mmunication. E 
j Latency — Communication latency may be 
n i 


Resource 


+ Time of flight 
i = Sender overhead 
communication oer, + Transmission time + Receiver overhead 
i low as possible. Sender 
i tency should be as c 
ommunication, la an 
PEAD is depends on hardware and software overhead is 
and eae mechanism of communication and its implementation. Time 
i e . . . = f 
RA always fixed and transmission time depends on He AR 
1 . . . 
; ea Ifthe latency is not hidden then it directly affect the performance by 
ee aking processor in wait state. 
locking up resources or m gp 


(iii) Communication Latency Hiding — It is very important 
aspect of communication mechanism. To hide the latency communication 
is overlapped by either local computation or any other form of 
computation. It can’t be measured as measuring communication latency. 


Latency hiding can be measured by measuring processing time on those 
machines which are having different su 


pport structure and same 
communication: latency. i 


0.26. Explain the following — 
(i) Computational granularity 
(ii) Communication latency. 


., RGRV, June 2014) 
Ans. 


to Q.24. 
atency — Refer to Q.25. 


performance of a computer system by balancing 


o machine architecture, 
ting technology. The 
s for latency tolerance 
factor on the'scalability 


(0) Computational Granularity — Refer 
(ii) Communication L 
canes can achieve better 

ty Steen, Various latencies are attributed t 
se a involved and implemen 
ao ne the design choice 
ine Pii » latency imposes a limiting 


communica 
architecture 
between sub 
of the mach 
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0.27. What is data flow graph R.GPV, Dec, 2015 
: A data flow graph is similar to a dependence graph op A ) 
gee only difference is that data tokens are passed around the a 
; in 


What do you understand by control-flow and data-flow computen» 
P 


0.28. m 
data-flow and control-flow computers. (R.GP.V, June 2013 


Compare 
“Ans, The shared memory 1s used by control-flow computers to hold 
ae ‘ons and data objects. Many instructions update variables jn 


: instructi ; à 
program 1 may be side effects of the execution ofone instru 


There a 

hared memory. ction 

the Ta instructions because memory 1s shared. The side effects Prevent 
on 


parallel processing from taking place in BoA ei of the use of the 
control-driven mechanism, a uniprocessor nace inherently Sequential 
However, with the help of parallel language constructs or parallel compilers 
control flow can be made parallel. am, e 
The execution of an instruction in a data-flow computer is driven by 
data availability instead of being guided by a program counter. Theoretically, 
ands become available any, instruction should be ready for 
eaa ae way, the instructions in a data-driven program are not 
cae eee held inside instructions instead of being stored in 
or ; 
ory. 
l m results (data tokens) are given directly koa 
instructions. The data produced by an instruction will be duplicated z a i 
of copies and forwarded directly to all needy instructions. Once da ‘ 0 : 
instruction, will no longer be present for reuse by other 
driven scheme does not need shared memory, ca 
counter and control sequencer. However, it needs special aaa eer 
data availability, to match data tokens with needy instructions, an nee 
the chain reaction of asynchronous instruction executions. No ane à 
are produced by memory sharing. Asynchrony indicates the eh m 
handshaking or token-matching operations. A pure data flow co ai 
fine-grain parallelism at the instruction level. When the ee mee 
executi 


massive parallelism would be possible. 


0.29. Explain demand-driven flow mechanism. 


Ans. The computation in a reduction machine is t18 
for an operation’s result. 


consumed by an 
instructions. This data- 


d 
gered by the den” 
© 


Let us consider the evaluation ofa nested arithmetic expressi Unit-1 31 
Pie (d+ e)). The data-driven computation selects a boaa a ((b+1) 
peginning from the innermost operations b + 1 and d = E thet Toe 

ration and at last to the outermost Operation ~ Such a doin’ A 
s 


known 
computation selects a top-down approach by first demanding the value of a 


which triggers the demand for evaluating the next-level expressions (b+1)x¢ 
and d + © which in turn triggers the demand for evaluating b + 1 at the 
innermost level. The results are then sent back to the nested demander in the 
reverse order before a Is evaluated. A demand-driven computation corresponds 
to lazy evaluation, since operations are run only when their results are needed 


by some other instruction. This approach matches naturally with the functional 
programming concept. 


0.30. What are the differences between string reduction and graph 


reduction machines ? 

Ans. Each demander in a string reduction model receives a separate 
copy of the expression for its own evaluation. A long string expression is 
decreased to a single value in a recursive manner. Each reduction step contains 
an operator followed by an embedded reference to demand the corresponding 
input operands. The operator is suspended when its input arguments are being 
evaluated. When all the arguments have been replaced by literal values the 
expression is called as fully reduced. 

The expression in a graph reduction model is expressed as a directed 
graph, By evaluation of branches or subgraphs the graph is reduced. Different 
parts ofa graph or subgraphs are reduced or evaluated in parallel upon demand. 
Each demander is provided a pointer to the result of the reduction. All references 
to that graph are manipulated by the demander. 

Manipulation of graph depends on sharing the arguments employing 
Pointers. This traversal of the graph and reversal of the references are carried 
out till constant arguments are not meet, This continues until the value of the 
expression is obtained and a copy is sent back to the original requiring instruction. 


te 231. Compare contr ol-flow, data-flow, and reduction computers in 
rms of the program Slow mechanism used. (R.GPV., June 2010, 2014) 


com Ans. A comparison among control-flow, data-flow, and reduction 
iit architectures is given in table 1.2. The degree of explicit control 
i i . = 
es from control-driven to demand-driven to data-driven. 
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Table 1.2 Data-flow, Reduction and Control-flow Computers 


Machine 
Model 


Basic 
definition 


Data-flow _ 
Data-driveny 


Reduction 


Control- 
(Demand-driven) noe 


(Control-driven) 


Conventional 
computation; 
token of control 

specifies when a 
statement should 
be executed. 


Eager evaluation; 
statements are ` 
executed when all 
of their operands 
are available. 


Lazy evaluation; 
statements are 
executed only when 
their result is required 
for another compu- 
tation. 


Advant- 
ages 


Only required instr- 
uctions are executed 


High degree of 
parallelism 


Very high potential 
for parallelism 


High throughput 


Full control 


Complex data and 
control structures 

are easy to imple- 
ment 


Free from side 
effects 


Easy manipulation 
of data structures 


Does not support 
sharing of object with 
changing local state 


Disadva- 
ntages 


Time lost waiting for 
unneeded arguments 


Less efficient 


Difficult in 
programming 


Difficult in preven- 
ting runtime error 


Time required:to 
propagate demand 
tokens .- 


High control. ` 
overhead 
Difficult in manipu- 
lating data structures 


Moreover, control tokens are used in control-flow computers and 
reduction machines, respectively. T he merits and demerits of the data-flow 
and reduction machine models depends on research findings instead of on 
extensive operational experience. i l 

0.32. Comment on the advantages and = Sores E aia 
complexity, potential for parallelism and cost UREA T pagan 
models. (R.GPV, 


Ans. Refer to Q.31. ss 
7 “NUMERICAL PROBLEMS x 


i 


Prob.4. Consider the execution of the follo wing mine 
of five statements.. Use Bernstein’s conditions to A 
parallelism embedded in this code. J ustify the portions 


in parallel and the remaining portions that must be execute 


o SLeC=DXE 
S2:M=G+C 
S3:A=BtC 


S4:C=L+M 


PEH sá 


segment consisting 


at can be executed 
d in sequentially. 


(R.GP.V, Dec 2016) 


ect the maximum 


ach statement 
ecute. The 


——~ Data Flow Dependence 
sree Resource Dependence 

—+— Antidependence yi 
—e— Output Dependence 


Fig. 1.13 Data and Resource Dependence 
Graph 


rding to Bernstein’s 
pairs can be 


g5 (ii) S2 || S3 (iii) 
i s3 (v) S4|| $5 


i: : 
a possible because S2 || S3, 
s3 || 55 and S5 || S2 are all 
possible. l t f | 

So, the parallel execution Fig. 1.14 Parallel Execution in Three 
requires only three steps, as Steps, Assuining 


shown in fig. 1.14. Three Adders are Available per Step 


Prob.5. Analyse the data dependency among the following statements 
ina given program — 
SI : Load R1, 1024 
S2 : Load R2, M(10) 
S3 : Add R1, R2 
S4 : Store M (1024), RI 
S5 : Store M(R2), 1024 
( i) Draw the depéridence graph to show all depéndencies. . 
(ii) Are there any resource dependencies, if only one copy of each 


Sunctional unit is available’in the CPU ? 


(R.GP.V., Dec. 2010) 
` Analy: 3 Or 
G ta data dependencies among the following stdtéménts — 
ey eee /RI <- 1024/ 
S. Wie g /R2 <— Memory(10)/ 
, /RI < (RI) + (R2) 


S5: F=G+E 


| S5 : Store M(R2), 1024 


S4 : Si 
ore M(1024), R1 /Mėmóry(1024) <— (R1I)/ 
/Memory(64) <— 1024/ 


ary 
diy 
ger 


put 
zbe 
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Note that (Ri) means that the content of register Ri and Mem 
contains 64 initially. TY(10) 
(i) Draw a dependence graph to show all the dependencies 
(ii) Are there any resource dependencies if only one copy af 
functional unit is available in the CPU ? each 
(R.GP.V., May 2 01) 
Sol. (i) Fig. 1.15 shows the dependence (s1) © 
graph to show all dependencies. 
(ii) S4 and S5 need to use the same © 
store unit in accessing the memory. Therefore they an Q 


are potentially storage dependent. Fig. 1.15 


UER 


| 
RKS, DYNAMIC ea | 


0.33. What is interconnection network ? 


In systems with many components, communication may be controlled by a | 
subsystem called an interconnection network. The function of the | 
interconnection network is to establish dynamic communication paths among | 
the components via the buses under its control. f 


with different network topologies. ( 
= Or 

Compare and comment on static interconnection networks in terms of 

node degree, network diameter and bisection width. (R.GP.V., Dec. 2010) | 


Ans. Direct links are fixed once built. Static networks use direct links. 
This type of network is appropriate for building computers where the 
communication patterns are predictable or implementable with static connections. 

(i) Linear Array — It is a one-dimensional network where N nodes are | 
connected by N — 1 links in a line as shown in fig. 1.16. The terminal nodes have 
degree 1 and the internal nodes have degree 2. The diameter 1$ 
N — 1, which is very long for large N. The bisection width b is equal to 1. The 
simplest connection topology is /inear arrays. The o— 
structure is not symmetric and creates a : i 
communication inefficiency when N is very large. Fig. 1.16 Linear Array 

(ii) Ring and Chordal Ring — As shown in fig. 1.17 (a), a 188 i 
formed by connecting the two terminal nodes of a linear array with one extra 


iok A e of 2. A 


ion WI i , . 
t two chordal rings as shown in fig. 1.17 


gree from 2 to 3 or 4. One and two 
two chordal rings. Generally, the more links added 


g by 
; . e equal t 
wer of 2. As shown in fig. 1.17(d) for a network of 


g nodes. It means that node i is connected to node j, if j — i] =26, for San 
„n-l and the network size is N = 2n, This type of barrel shift 
= n/2 and a node degree of d = 2(n-— 1). 2 


Clearly, the connectivity in the barrel shifter is increased over that of any 
chordal ring of lower node degree. For N = 8, the barrel shifter has a diameter 
of2 and node degree of 7. However, the barrel shifter complexity is still much 
lower compared to that of the completely connected network fig. 1.17 (e). 
2 2 


6 
(b) Chordal Ring of Degree 3 


OF 


6 


A SOY eo 


(€) Chorda Ring of Doo. 
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wi) Mesh and ci . F > F (a) shows a 4 x 4 mesh network. 
; k-dimensional mesh should have N = nk nodes h interi 
| veo : edi mre eae tin nian es has an interior 


. node Hee The torus network shown in 
i ig KO zi (b) can be viewed as another 
fig 1.2 f mesh with an even shorter 
iant o 4 
vari o Generally, an n * n binary torus 
diame ode degree of 4 andadiameterof2 (a) Mesh (b) Torus i 
À Ni F Th bisection width is equal to 2n. Fig. 1.20 > 
l | : --) systolic Array — The data processing circui i : 
(d) Barrel Shifter (e) Completely Connected an e ai ae closely related conceptually i eae ) 
Fig. 1.17 arf A arrays are build by interconnecting a set of similar data- rocessiti 
(iv) Tree and Star — Fig. 1.18 (a) shows a binary tree of 15 nodes; si a a uniform fashion. From cell to cell, data words flow aay 
four levels. Generally, a completely balanced binary tree having k-level sh 1, | with each cell making a small step in the overall operation of the array. The 
have N = 2k — 1 nodes. The diameter is 2(k — 1) and the maximum W ia e are not fully processed till the end results do not emerge from the array’s 
degree is 3. The binary tree is a scalable architecture with a constant ie | boundary cells. Therefore, a one-dimensional systolic array is a type of pipeline 
degree. Although, the diameter is very long. The bisection width is equal to i with similar stages. There is a structure in a two-dimensional systolic array 
` f notunlike the divider array, but its cells are sequential instead of combinational. 


Fig. 1.18 (b) shows that the | li il d 
star is a two-level tree with a high | Generally, a systolic array allows data to flow through the cells in different 
node degree of d = N — 1 and a | directions at once. There must be buffering within the cells to isolate different 
small constant diameter of 2. The sets of op cinas non ee another as in pipelines. To implement different 
PEE ; | complex arithmetic operations like convolution, matrix multiplication and 
itecture of star has been use solution techniques for linear equations, systolic pro h 
in systems with a centralized , processors have been designed. 
(b) Star eee 


supervisor node. The bisection (a) Binary Tree 
(a) One-di. ; : 
) One-dimensional Linear Array (b) Two-dimensional Square Array 


width is equal to [N/2]. Fig. 1.18 

(v) Fat Tree — Fig. 1.19 shows a binary fat tree. As we ascend , 
from leaves to the root the channel width of a fat tree increases. The fat tree | 
is just like a real tree where branches get thicker toward the root. 


Fig. 1.19 Binary Fat Tree 
In using the conventional binary tree, there are some problem 
to eliminate, these problems, the fat tree was proposed. 


s. Therefore, © Two-di i 
: ‘mensional Hexagonal Array 


Fig, i 
ig. 1.21 Various Systolic Array Co 


(d) Triangular Array 
nfigurations 


ba 
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-bound algorithms VLSI systolic arrays can assum 
ous systolic array configurations are shown i 
fig. 1.21. These computations make the basis of signal and image processin n 
matrix arithmetic, combinatorial, database algorithms. Systolic technique 
received a great deal of attention recently because of their simplicity and Stone 
appeal to intuition. Although, the implementation of systolic arrays ona Vs 
chip has several practical constraints. 

(viii) Hypercubes — The Aypercube or binary n-cube multiprocessg 
structure is a loosely coupled system consists of N = 2" processor 
interconnected in an n-dimensional binary cube. Each processor forms a n 
of the cube. Although it is customary to refer to each node as having, 
processor, in effect it contains not only a CPU but also local memory and O 
interface. Each processor has direct communication paths to n other neighboy 
processors. These paths correspond to the edges of the cube. There are P 
distinct n-bit 
processor address 
bit pcsition. 

Fig. 1.22 shows the hypercube structure for n = 1, 2 and 3. A one-cube 

su scture has n = 1 and 2" = 2. It contains two processors interconnected by 
a single path. A two-cube structure has n = 2 and 2" = 4. It contains fou 
nodes interconnected as a square. A three-cube structure has eight nodes 
interconnected as a cube. An n-cube structure has 2" nodes with a processor 


residing in each node. Each node is as 


For various compute 
many distinct structures. Vari 


that the a 
example, the three neighbours of the node with address 100 in a three-cube 


structure are 000, 110 and 101. Each of these binary numbers differs from 
address 100 by one bit value. 


Three-cube - 


One-cube Two-cube 


Fig. 1.22 Hypercube Structures forn=1, 2, 3h 
e structure may take fr 
e. For example, in 
h node 001. It must cross? 


Routing messages through an n-cub 
links from a source node to a destination nod 
structure, node 000 can communicate directly wit 


Jeast two links to com 


municate with 011 (from 000 to 001 to O11 of 


binary addresses that can be assigned to the processors. Each } 
differs from that of each of its n neighbours by exactly on. 


| | 
1 ( 
A ; ; | 
1 11 -e joi on | 
| 
| modules. A bus . -sharing bu . 
oth system is che ý s among multiple functi 
o m oo oo [A | er two dynamic eStnestion aay ial a limited bandwidth ee 
orks. 


— 


ee ~ 
g 


Unit - I 


alo” ; i i 
| 035 piffer entiate between the binary tree and fat tree interconnection 
(R.GPV., Dec. 2010) 


Ans: Q.34 (iv) and (v). 
0. 36. Give a brief introduction to dynamic interconnection network 
; se network use configurable paths and do not have a pro ; 
ith each node. Processors are connected dynamically Po eae 
use dynamic connections which can PR ic 
a 


E- ing paths to offer the dynamic connectivity instead of usi 
; connections. Dynamic connection networks include bus systems Sei 
| interconnection networks (MIN) and crossbar switch networks a i istage 
order of cost and performance. The performance is shown by the ea 
or 


bandwidth, data transfer rate, network latency an icati 
twork communication pattern 
d s 


L 

: signed a binary address in such a way | assisted. The price tags of dynamic interconnection 

ddresses of two neighbours differ in exactly one bit position. Fo! the cost of the wires, switches, arbiters and spore oa ke 
rs required. 


0.37. Discuss bus systems (digi i 
| ry. (digital buses) interconnection network. 


} Ans. A collection of wires and c 
onnectors for data t i 
ransactions amon 
8 


processors, memory modules, and iph : 
called a b ? peripheral devices attac : 
a elie Only one transaction at a time takes Te ra Ta 
logic must be able A and a destination (i.e., slave). The b Eer a 
ocate or deallocate the bus servicing thee oe a 
requests one 


of multipl 
know iple requ That is wł 
own as contention bus or a tne. sts. That is why, the digital bus is also 


A bus-c 
bus onnected multi 
acts as a comn Baten NA system is shown in fig. 1.23. The syst 
i etad. ystem 


ystem a cati 
tape units a the memory modul lon path between the processors or /O 
Printed ego e The system bus į es or secondary storage devices like di 
through c cuit board. Oth S 18 mostly implemented on a b i 
Onnectors Orc a boards are plugged int a ackplane ofa 
ables for processors, mem nto the backplane board 
> ories or device interfaces. 


Subs 


= 


; es bus control. On completion, it f 
evice assum ~~ On, I resets the bus-busy flag in 
d lér and if other requests are outstanding a new BGT signal is Seemed 
tro ‘s used by the DEC PDP-11 Unibus. Incorporates such a bus 
it is incorporated by the Motorola MC68000 processor. 


i) The Fixed Time Slice Algorithm —This algorithm split the available 


pnn 


Jo | i er 
l) : | d ai into fixed-length time slices that are then sequentially provided to ~ 
BE pus ban coma round-robin manner. Should the chosen device elect not to use 
ze | each ee the time slice remains unused by any device. This scheme, known D; 
3 l its n pay slicing (FTS) or time division multiplexing (TMD), is used by Ta 
= as P parallel communications link, which also permits a flexible assignment ” 
. gital 5 : . i i igue i : ! 
Fig. 1.23 A Bus-connected Multiprocessor Sy l or vailable time slices to the paa T technique is ie se synchronous T;) 
i Sten 0 ices are synchronized to a common clock. 
To address the memory, the active or master devices ' | puses where all pue dt k device in the FTS technique fi S 
or /O subsystem produce requests. The passive H aS Processo. | The service provided to © Sis ee ee ihe 
; : Or slave devic Bo not depend on that device’s position or identity on the bus. 
memories or peripherals reply to the requests. On a ti ICES suchas | the bus does : = gs ied : S. 
: - Dia me-sharine bac; | - yes with this characteristic are called symmetric. In this scheme, all 
common bus is used. The important busing issues are the © PASIS the | Techniqu f : ; . : : 
: an : coherence Prot ices are given one out of every m time slices at fixed intervals. Since no 
transaction processing, interrupts handling and bus arbitrati Ocok, | m devic . - me : : 
ation, jority is given to any device, symmetric bus-arbitration algorithms optimally 
Q.38. Explain bus arbitration algorithm in the time sha dil P ad-balance all bus requests. Further it provides a bounded maximum wait 
organization of MIMD system. (R.GPY. Juns. ci time to the devices. Although, it suffers a high average wait time, and hence, 
Ans. The bus arbitration algorithms are usually implemented % | alower bus utilization. l i 
hardware and permit the arbitration for a bus cycle to be overlapped with na FTS incurs a substantially higher standard deviation from all wait times 
previous transfer. These algorithms are discussed below - € | than does the static priority scheme when the bus is not heavily loaded, although 
(i) The Static Priority Algorithm — The requesting devices ar ee os ae ioe ei oe bus 
assigned unique static priorities by the digital buses used today. When requests + ree a 7 nats r Suey OUO Duh loading, 
come from multiple devices concurrently for the use of the bus, the device ` p (andy ue nei Abes: Uns The following dynamic priority 
having highest priority is granted access to it. This approach is implemented aeons oa the load-balancing features of symmetric algorithms like 
using a scheme known as daisy chaining. In this scheme all servicesae | £ ay j icing to be achieved without incurring the penalty of high wait 
effectively assigned static priorities according to their locations along a bs ta oa a given unique priorities and compete to access the bus, 
grant control line. The device nearest to a central bus controller is givente | is ee sahara changed to provide every device an opportunity 
highest priority (see fig. 1.24). A common request line, BRQ is used to make Ean] ae e algorithm used to permute the priorities favours no . 
requests. If the acknowledge signal (SACK) specifies that the bus is idl, t A Morente. a coca the sy stem load is used to balance the bus requests. 
central bus control unit sends a bus grant signal (BGT). | slice t hni priori 1eS gets rid of the inefficiency inherent in the fixed time : 
echnique of allocating full t li i binary 
: t placed. The p e 8 tull time slices to the devices before requests are : and y 
= Bus Gran | used (LRU) and th i OE dynamically permuting priorities are the /east recently Pipe 
=| by psy The LRU algo ey chain (RDC), acing 
E ? algori : . u . : 
z Bus Request -that has not E p a assigns the highest priority to the requesting device innit 
2 : ` PMlorities after each pe for the longest interval. This is done by reassigning sere 
á| E to generalize the ae s n The second dynamic priority algorithm is used ~ 
Y chain implementation of static priorities. In the daisy 


F y i Bus f ch i Š 
Fig. 1.24 Static Daisy Chain Implementation ofa Bi BGT sige! Priori a que all devices are given static and uni iorities based on thei 
. : : t receive i: Son z : and unique priorities based on their 
The first device which has issued a bus request tha by this in the contd | a bus-grant line emanating from a pets consi 


prevents the latter’s propagation. The bus-busy flag is set 


Partial Bus 
Controller 


1 n 
EE EE 


pae ...seo | Device 


Bus Busy 
Bus Request 


Fig. 1.25 Rotating Daisy Chain Implementation of a System Bus 


No central controller is present in the RDC scheme, and the bus-grant 
line is connected from the last device back to the first in a closed loop as 
shown in fig. 1.25. The device which is granted access to the bus work as 
bus controller for the following arbitration. For a given arbitration, each device’s 
priority is determined by that device’s distance along the bus-grant line from 
the device currently serving as bus controller. The priority of the latter device 
is the lowest. Therefore, the priorities change dynamically with each bus cycle, 


(iv) The First Come First Served Algorithm (FCFS) — In this 
scheme, requests are serviced in the order received. This algorithm is 
symmetric as it favours no particular processor or device on the bus. Thus, it 
load balances the bus requests. We know that, under the condition of fixed 
service times by the central resource, FCFS provides the smallest possible 
average wait time and standard deviation of all wait times. In fact, FCFS is the 
optimal bus-arbitration algorithm with respect to these performance measures. 


Unfortunately, FCFS is hard to implement due to two reasons. Any 
implementation of FCFS must provide a mechanism to keep track the arrival 
order of all pending bus requests. It is also possible that two bus requests can 
arrive within a sufficiently small interval so their relative ordering cannot be 
correctly distinguished. Therefore, any implementation can only approximate 
the behaviour of FCFS. Despite the above problems in realizing an 
implementation, it is significant to measure the performance of FCFS as an 
indicator of the best possible performance that a bus-arbitration algorithm can 
get with respect to the above criteria. 


Polling and independent requesting are two other techniques used in bus- 
control algorithms. The bus grant signal (BGT) of the static daisy chain 
implementation is replaced by a set of [logan] polling lines in a bus-controller 
using polling, as depicted in fig. 1.26. The set of poll lines is associated each 
of the devices. The controller sequences through the device addresses by 
using the poll lines on a bus request. When a device D; which requested 
access identifies its address, it raises the SACK line to indicate bus busy. 
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logn] 
Lines 


Bus Busy 


Bus Control Unit 


Bus Request 


Fig. 1.26 Polling Implementation of a System Bus 


The bus-control unit acknowledges by ending the polling process and D; 
gains access to the bus. The access is managed until the device lowers the 
SACK line. It is noted that, the position of a device in the polling sequence 
determines its priority. A separate bus request (BRQ;) and bus grant (BGT;) 
line are connected to each device i sharing the bus in the independent requesting 
technique, as depicted in fig. 1.27. This requesting technique can allow the 
implementation of LRU, FCFS, and a variety of other allocation algorithms. 


BGT] 


Devi 
or PT 
[scr T7 


Bus Control Unit 


e HEHHE p] 
n 
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SS ee 


Fig. 1.27 Independent Request Implementation of a System Bus 


Q.39. What are switch modules ? 


Ans. A x x y switch module contains x inputs and y outputs. A binary 
switch is a 2 x 2 switch module where x = y = 2. But, in theory, x and y 
should not be equal. However, in practice, x and y are often selected as integer 
powers of 2, i.e. x = y = 2K for some k > 1. 

Table 1.3 lists different commonly used switch module sizes. An input 
can be connected to one or more of the Outputs. However, conflicts must be 
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ided at the output terminals. In other way, one-to-one ang a 
av eae are permitted. But many-to-one mappings are not p ermited onan 
map g : : 
of conflicts at the output terminal. ` 


If only one-to-one mappings i.e., ee are Permitted, then i. 
dule is called an n x n crossbar switch. or instance, a 2 x 2 cro 
mor n connect two possible patterns — straight or crossover. Gene at 
switch t can have n! permutations. Table 1.3 lists the mambo 
P ara connection patterns for switch modules of different sizes, TS oj 


Table 1.3 Switch Modules and Legitimate States 


Permutation Connections 


Module Size Legitimate States 


0.40. Explain the crossbar switch organization for interconnection o). 


nmultiproces ors. 
f P Or 


Descric. briefly the term crossbar switches associated with , 
multi- ~ocessor system. (R.GPV., Dec. 2003 


Or 
Explain the crossbar switch. (R.GPV., Dec. 201; 


Ans. If the number of buses in a time-shared bus system is expanded, 
point is reached at which there exists a separate path available for each memor 
unit, as depicted in fig. 1.28. The interconnection network is then known as 


nonblocking crossbar. l 
The crossbar switch has complete connectivity with respect to the mie 
modules as there exists separate bus associated with each memory moe 
Thus, the maximum number of transfers that can ciara tine A 
limited by the number of memory modules and the bandwidth-speed pr 
of the buses instead of by the number of paths available. l a 
The significant characteristics of a system using a crossbar ae sari 
matrix are the extreme simplicity of the switch-to-functional uni se a 
and the ability to support simultaneous transfers for all ree Tail 
offer these features needs major hardware capabilities in the Jane L 
must each cross point be able to switch parallel transmissions, bu ; E Pdl 
be able to resolve multiple requests for access to the same E request 
happening during a single memory cycle. Usually these conflic uence’ 
- are controlled on the basis of a predetermined priority. The con ment th 
the inclusion of such a capability is that the hardware needed to ee a 
switch can become quite large and complicated. However very | 


integration i 
affect on its complexity. 


Input/Output 2 


(VLSI) can decreases the size of the SWitch, 
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Fig. 1.28 Crossbar (Nonblocking) Switch System Organizati- 
, Sor Multiprocessors 


In a crossbar switch or multiported device, conflicts takes 


two or more concurrent requests are made to t 


vo or more concu 


Assume that there are 16 destination devices ( 
requestors (processors). The implementation to be explained 
for a processor to device connection. An example functional d 


switch element or multiported 
memory for one module is 
shown in fig. 1.29. The switch 
composed of arbitration logic 
and multiplexer modules. Each 
processor produces a memory 
module request signal (REQ) 
to the arbitration unit, which 
chooses the processor with 
the highest priority. The selec- 
tion is done with a priority 
encoder. An acknowledge 
signal (ACK) is sent by the 
arbitration logic to the chosen 
Processor. It initiates its 
memory operation after the 
Processor receives the ACK. 


The multiplexer module multipl 


module and con 
the help of a 16 
number of the 

encoder wit 


From Po-Pys 


ie chosen processor. 
hin the arbitration logic 


Data <> 


RD/WR ES Multiplexer 
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ACKy 
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ACK, Arbitration 


Logic 


REQi5 
ACK. 


Fig. 1.29 Functional Structure ofa 
Crosspoint in a Crossbar Network 


it will have minor 


lace when 
ade to the same destination device. 


memory modules) and 16 
ean also be used 
es: of a crossbar 


Shared Memory Module 


exes data, addresses of words within the 
trol signals from the processor to the memory module with 
-to-1 multiplexer. The multiplexer is handled by the encoded 
This code was produced by the priority 


we 
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anana ee 
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This scheme was utilized to implement the 


- . processor- ; stem interconnect 
for the C.mmp, which contains 16 Processors and 16 ca ioy Switeh the following terms for varios sy 
i 4 ne 
Pe composed of 16 sets of cross points from one procesa dues, Th gAl. Defi 1) Bisection width 
6 memory ports and another 16 sets of cross points from ọ OF Port tg th phitectures twork diameter . Non-blocking networks 
to the 16 processor ports. Theoretically, enhancement of the ery Don ar (i) x de degree (iv) 
a . 5 j; S a os oer 
on y by the size of the switch matrix, which can often be madal aS limiteg H Digital buses. (R.GP.V., June 2010) 
pi within initial design or other engineering drawbacks. An effect aly Mcrease4 w) l OAT 
: crossbar interconnection system is the feasibility of designing ins VLSI on th ork Diameter - The maximum shortest ea be ween any 
for a larger capacity compared to initially needed and equippin oo Matrices Ans. Ü) Nig ee er D of a network. The number oLan ae ea 
the present needs. Expansion would then be facilitated hie em only for | two nodes is the 20 j rk diameter specifies the maxımum 


d d : PE ie, a 2 ause all ee he path | t 1. The netwo 
needed is the addition of the missing Cross points. that jg measure t sa any two no 
ork. Therefore, from a 
hould be as small as possible. 


des, thus giving a figure of communication 


: mmunication point of view the 
A natural extension of the crossbar switch c ea 


switch on the device side of the I/O processor o 
flexibility required in access to the input-outp 
fig. 1.30. The hardware needed for the implementation IS quite diffe the minimum number of ed 
not nearly so complicated si i ao isecti ? 

y p ince controllers and devices are normally designed | the channel bisection wiata. ork. Then the wire bisection width B 


Tent ang | equal parts, Hic- b, Each edge corresponds to a channel with w bit 
to identify their own unique addresses. The effect i t : aaee of a communication netw 
q is the same as if there were | wiresin the case i B provides the wiring density of a network. The 


oncept is to use 4 Same | merit for t 

r channel to Provide i network diameter S 

ut devices as shown S (ii) Bisection Width 
in 


a primary bus connected with each I/O channel and crossbuses fi ; bw. This parameter y olg $ i 
al to bw. : 
controller or device. ae a A dh in bits) w = B/b when B is fixed. Thus, the bisection width gives 
o indicator of the maximum communication bandwidth along the bisection 
a a. All cross-sections are bounded by the bisection width. 


Input/Output 
Channel 1 


(iii) Node Degree — The node degree d is the number of edges (links 


or channels) incident on a node. The number of channels into a node is the in 
degree, and that out a node is the out degree in the case of unidirectional 
channels. Then the sum of the two is the node degree. The node degreé 
provides the number of I/O ports needed per node, and therefore the cost of a 
node. Hence, the node degree should be placed a constant, as small as possible 
to decrease cost. A constant node degree is very much preferred to obtain 


` modularity in building blocks for scalable systems. 
| (iv) Non-blocking Networks — Refer to Q.40. 
(vy) Digital Buses — Refer to Q.37. 


Processor 2 


Input/Output 
Channel 2 


Fig. 1.30 A Crossbar Organization for Inter-processor-memory — 1/0 
Connections Q.42. Compare and comment on static and dynamic interconnection 
There is potential for the highest bandwidth and system efficiency in the | #etwork in terms of node degree, network diameter and bisection width. 


crossbar switch. Although, due to its complexity and cost, it may not be cost- (R.GBV,, May 2018) 


as Ans. 
effective for a large multiprocessor system. However, the eee a ns. Refer to Q.41, a 
switch is problematic, it can be enhanced by segmentation perme to 0.43. Explain multiport memory organization in detail. 
within the switch. Generally, it is normally quite easy to split the cat rae Or 
logically isolate malfunctioning units. There are a number of exa ee *plain the multiport memory. (R.GPV., Dec. 2017) 


Ans. Fig. 1.31 shows 
control, switchi 
crossbar switch 


š . : are 
systems using the crossbar interconnection systems. Some of these 


j that a multiport memory system is result if the 
and the S — | multiprocessor systems. 


ng and priority arbitration logic that is distributed over the 
Matrix is distributed at the interfaces to the memory modules. 
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Fig. 1.31 Multiport-memory 

Organization without Fixed 

Priority Assignment 

Multiport-memory is used in both uni-processor and multi-processor 
system organizations. Basically, the method used to resolve memory-access 
conflicts is to allocate permanently designated priorities at each memory port. 
Then the system is configured at each installation to offer the appropriate 
priority access to different memory modules for each functional unit, as depicted 
in fig. 1.32. 

All of the ports are usually electrically and operationally same except for 
the priority associated with each. The ports are just row of identical cable 
connectors. The flexibility possible in configuring the system makes it possible 
to designate portions of memory as private to some processors, I/O units or 
combinations thereof, as depicted in fig. 1.33. 


= E 


a 
iS a 
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Processor 1 
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Aemory 2 | 


ed hed 
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Processor 2 


Input/Output] [Input/Output 
Channel 2 Channel 1 


Input/Output} | Input/Output 
Channel 2 Channel 1 


Fig. 1.32 Multiport-memory 
System with Assignment of 
Port Priorities 


Input/Output 
Channel 2 


Input/Output 
Channel 1 


Fig. 1.33 Multiport-memory Organization 


In fig. 1.33, memory modules 1 and 4 are private to processors | and 2, 
respectively. This type of system have certain benefits in increasing protection 
against unauthorized access and may allow the storage of recovery routines in 
memory areas that are not susceptible to modification by other processors. 
There are also drawbacks in system recovery if the other processors are not 


able to access control and status information in a memory block associated 
with a faulty processor, 
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If a full-connected topology is utilized the multiport-memory system 
organization helps nonblocking access to the memory. It also allows the 
exploration of interleaved memory addresses for access by a single processor 
because each word access is a separate operation. Interleaving may degrade 
memory performance by expanding the number of memory-access conflicts 
which happens as all processors cycle through all memory following a sequence 
of consecutive address. If there is a failure, interleaving also results in the loss 
of more than one module of memory. Examples of multiport memory system 
are Univac 1100/90 and IBM System 370/168. 


0.44. Describe briefly the multistage network for interconnection of 
multiprocessors. 
Or 
Write short note on multistage and combining networks. 
(R.GP.V, June 2014) 
Or 
Write short note on multistage connection networks. 
(R.GPV., Dec. 2014) 


Ans. Multistage interconnection networks (MINs) are used in both MIMD 
and SIMD computers. Fig. 1.34 shows a generalized multistage network. 
Many x x y switches are used in each stage. Fixed interstage connections are 
utilized between the switches in adjacent stages. The switches can be 


dynamically set to set-up the required connections between the inputs and 
outputs. 


Stage 1 Stage 2 Stage n 
0 0 
1 xxy E e] xxy 
i Switch Bg =| Switch 1 
x e a s y-1 
x= y E E S 
x+1 xxy [n 3 3 z ma xx y 
Switch | z = = E] SWR y+il 
2x-1 S 5 = vite 
© (%5) io) 2y-1 
Ea & & : 
> 5 a a 
2 a (EEE) 2 ë 
2 2 By e 
= | E E : 
xn_ x i = = 
xxy xxy Iny 
mdi Switch p g Switch 
yn-1 


Fig. 1.34 A Generalized Structure of a Multistage Interconnection Network 
(MIN) 

Various classes of MINs differ in the swi 

of interstage connection (ISC 

be the 2 x 2 switches (x= 


tch modules used and in the type 
) patterns used. The easiest switch module would 


y = 2). The ISC patterns mostly used are perfect 


Sea a 
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huffle, butterfly, multiway shuffle, crossbar, cube connection, etc, Multist 
s u s are used to form larger multiprocessor systems. A special case of 
ae networks for resolving conflicts automatically through the network 
m 


is combining. The combining network has been formed into the NYU's 


ultracomputer. 
0.45. Discuss in brief omega network. 
Ars. Omega network is a type of multistage network, The four possible 
connections of 2 * 2 switches used in constructing the omega network k 
shown is fig. 1.35 (a) to (d). Fig. 1.35 (e) shows a 16 x 16 omega network 
Four stages of 2 x 2 switches are required. There exist 16 inputs on the left and 
16 outputs on the right. The perfect shuffle over 16 objects is the ISC pattern, 


From the input end to the output end the stages are labeled from 0 to 
logon — 1. Data routing is handled by inspecting the destination code in binary, 
A 2 x 2 switch at stage i connects the input to the upper output when the jth 
high-order bit of the destination code is a 0. Otherwise, the input is directed to 
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the lower output. 


0 0 0 0 0 0 0 0 
1 1 1 1 1 1 1 1 


(a) Straight (b) Lower Broadcast (c) Upper Broadcast (d) Crossover 


(e) A 16 x 16 Omega Network 


Fig. 1.35 The Use of 2 x 2 Switches and Perfect Shuffle as an Interstage 
Connection Pattern to Construct a 16 x 16 Omega Network 
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Fig. 1.36 (a) and (b) show two switch settings with respect to permu- 
tations n; = (0, 7, 6, 4, 2) (1, 3) (5) and x, = (0, 6, 4, 7, 3) (1, 5) (2), respectively. 

For the implementation of z}, the switch settings are shown in fig. 1.36 
(a), which maps 0 > 7, 7 > 6,6 > 4, 4 > 2,2 > 0, 1 > 3,3 > 1,5 > 5. 
Consider, for example, the routing ofa message from input 001 to output 011. 
There is no conflict in all the switch settings required to implement the 
permutation 7}. Now consider implementing the permutation 7, in the 8-input 
omega network in fig. 1.36 (b). There exist conflicts in switch settings in 
three switches identified as F, G, H. The conflicts taking place at F are due to 
the desired routings 000 > 110 and 100 — 111. Both inputs to switch F must 
be connected to the lower output because both destination addresses have a 
leading bit 1. One request must be rejected to resolve the conflicts. Likewise, 
there exist conflicts at switch G between 011 — 000 and 111 —> 011 and at 
switch H between 101 —> 001 and 011 — 000. Broadcast is used from one 
input to two outputs at switches I and J, which is permitted if the hardware is 
built to have four legitimate states as shown in fig. 1.35 (a). The above example 
specifies the fact that not all permutations can be implemented in one pass 
through the omega network. , 

The omega network is a type of blocking network. One can set-up the 
conflicting connections in several passes in case of blocking. For example m3, 
we can connect 000 > 110, 001 > 101, 010 > 010, 101 — 001, 110 > 100 
in the first pass and 011 —> 000, 100 — 111, 111 — 011 in the second pass. 
Generally, an n-input Omega network can implement n™2 permutations in a 
single pass if2 x 2 switch boxes are used. Therefore, there are total n! pemutations. 
For n= 8, only 84/8! = 10.16% of all permutations are implementable in a single 
pass through an 8-input omega network. All others will create blocking and 
demand up to three passes to be realized. Generally, a maximum of logon passes 
are required for an n-input omega. Blocking is not a preferred characteristic in 
any multistage network. It may reduce the effective bandwidth. 


Inputs Stage 0 Stage 1 Stage 2 Output 
ooo, 000 


000 ——— 


001 001 
010 010 
oll oll 
100 100 
101 101 
110 10 
u ———+ 11 


(a) Permutation 1, = (0, 7, 6, 4, 2)(1, 3)(5) Implemented on an Omega 
Network Without Blocking 


PRE SEE SEERA SET A COLE ee he aR OES 
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Stage 0 Stage 1 Stage 2 Output 


Inputs 
000 
001 


010 
ou 


100 
101 


110 

ul 

(b) Permutation 7 = (0, 6, 4, 7, 3)C1, 5)(2) Blocked at 
Switches Marked F, G and H 


Fig. 1.36 Two Switch Settings of an 8 x 8 Omega Network Built 
with 2 x 2 Switches 


0.46. Distinguish between omega and crossbar networks. 
(R.GP.V., Dec. 2014) 


Ans. Refer to Q.45 and Q.40. 


0.47. What is interconnection network ? Explain different 
interconnection network architectures comparing their architectural features. 
(R.GP.V., June 2011) 

Or 
Compare static interconnection networks and dynamic interconnection 


networks. (R.GP.V., June 2012) 
Or 


Distinguish between static and dynamic connection networks. 
(R.GP.V., Dec. 2014) 


Or 


Explain the static and dynamic interconnection networks. 
(R.GP.V, June 2015) 


Or 
Distinguish between static interconnection network and ann 
interconnection network, (R.GP.V., Dec. 2016) 


Ans. Interconnection Network — Refer to Q.33. 
Static Interconnection Network and its Types — Refer to Q.34. 


Dynamic Interconnection Network and its Types — Refer to Q.36, 
Q.37, Q.40 and Q.44. 
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0.48. Explain the following terminologies associated with SIMD 
computers — 
(i) Cube routing function 
(ii) Mesh connected Illiac network 
(iii) Shuffle exchange and omega networks. 
(R.GP.V., Dec. 2016) 
Ans. (i) Cube Routing Function — The multistage and recirculating 
networks are the implementation of cube network. A cube routing function 
specifies the network of N processing elements for n-dimensional cube network. 
Here, processing elements are vertical lines which connect vertices. For n- 
cubes n = log,N, where N is number of processing elements. 


010 opkep] ERB] GEE] GZ 


110 ae Ci ood 3 | 4 | Gi o 
ae 
C2 


100 101 


Fig. 1.37 Cube Fig. 1.38 Cube Network Recirculating Using 
Routing Function 


Let A=(a,_| ..... a, a; ag) is binary sequence to represent vertex address 


for 0 <A < N- 1. The n routing functions to determine n-dimensional cube 
network is 


Ci (an-1 eesee ay ay ao) = 4n-1 sodus ai+l aj aj-] 


. (For i = 0, 1, 2,°3, ....., n — 1) 
The interconnection of processing elements corresponding to routing 
functions Co, Cy and C, is shown in fig. 1.38. 


7 (ii) Mesh Connected Illiac Network — In this network, every PE; 
(i" processing element) can send data to ahere 


r= vN. 


Simply, illiac network is formed by using four routing functions 


(a) C} (i) = (i + r) mod N 
(b) C, @) = (i — r) mod N 
(c) Ci, (i) = (i+ 1) mod N 
(d) C (i) = (i — 1) mod N 


where 0<i<N-1 
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For example — N=16,r=4 
Cy, = (012 15) F 
C = (15 14 210) 13d 
C= (048 12) (1 5 9 13) PE} 
(26 10 14) (37 1114) " . 


C_4= (12840) (13951) 


(14 10 6 2) (15 11 7 3) 


“a4 k Fig. 1.39 Mesh Connected Iili 
The mesh connected illiac networ á a 
is shown in fig. 1.39. Network (N = 16 PE) 


(iii) Shuffle Exchange and Omega Network — Shuffle exchange is 
based on shuffle (S) and exchange (E). Let A be the address of Processing 


element such that 
A =a- | we Ay AQ 
So, S(A) = S (ap — 1 ---- âj AQ) = Aq -2 ove By AQ an- 4 
and E(A) = E (a, -1 «+++ @] Ag) = Ap_y.----4y a 
where 0< AS N-1 and n = log,N 
Omega Network — Refer to Q.45. 


ob db 


ACESS 
INSTRUCTION SET ARCHITECTURE, CISC SCALAR 
PROCESSORS, RISC SCALAR PROCESSORS, VLIW 
> ARCHITECTURE © : 


CTI 


Q.1. What is instruction set ? 

Ans. The instruction set defines the machine instructions or primitive 
commands that a software programmer can use for programming the machine. 
Instruction set complexity is attributed to the data formats, instruction formats, 
addressing modes, opcode specifications, flow-control mechanisms and 
general-purpose registers used. 


Q.2. Write short note on complex instruction sets. 


Ans. At the beginning of computer history, most computer families begin 
with an instruction set that was very easy. The reason behind it was the high 
hardware cost. However, in the last 30 years, the cost of hardware has 
decreased and the cost of software has increased. In addition, the semantic 
gap between computer architecture and high level language (HLL) features 
has increased. 

As a consequence, more number of functions have been built into the 
hardware, thus making the instruction set large and complex. In the 1960s 
and 1970s, the instruction sets growth was also encouraged by the popularity 
of microprogrammed control. For special-purpose applications in some 
processors, even user-defined instruction sets were implemented by means of 
microcodes. 


Typically, a CISC instruction set consists of nearly 120 to 350 instructions 
employing variable instructions/data formats, utilizes a small set of 8 to 24 
general-purpose registers and executes many memory reference operations 
based on more than a twelve addressing modes. In a CISC architecture, large 
number of HLL statements are implemented directly in hardware/firmware. It 
enhance execution efficiency, simplifies the compiler development and permit 
an extension from scalar instructions to symbolic and vector instructions. 
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ite short no 

Z ; m with RISC instruction sets and in the 1980s, we 

- inetruction sets. Computer users began to reevaluate the perf 

the CISC Fed een instruction set architecture and available ha 
ie after 20 years of using CISC processors. 

software P a RISC instruction set includes not more than a 100 instructions 

; hoe aaao format of 32-bits and uses only three to five addressing 

wit Instructions are oftenly register-based, only load and store instruct; One 

s. o access memory. With hardwired control, most instructions execute 

d a large register file is used to enhance fast context switching 

s. The whole processor is implemented on a single VLS] 

reduction in complexity of instruction set. The resulting 

lower CPI and a higher clock rate which results in higher 


te on reduced instruction sets. 


Ove to 
rmance 
rdwareę/ 


mode 
are used t 
in one cycle an 
among multiple user: 
chip because of the 
advantages involve a 
MIPS ratings. 

0.4. Compare the instruction set architectures in RISC and CISC 
processors in terms of control mechanisms, addressing modes, clock rate 
and expected CPI, register file and cache design, and instruction set. 

Or 
Compare the instruction set architecture in RISC and CISC processors 


in terms of instruction formats, addressing modes, and cycles per instruction. 
(R.GP.V., Dec. 2014) 


Ans. The comparison between RISC and CISC processors, in terms of 
control mechanisms, addressing modes, clock rate and expected CPI, register 
file and cache design, and instruction set, is given in table 2.1 


Table 2.1 Features of RISC and CISC Architectures 


S. 
"| Feat 
a eatures 


(i) | CPU control 
mechanism 


Reduced Instruction Set | Complex Instruction Set 
Computer (RISC) Computer (CISC) 


Most microcoded employ- 
ingcontrol memory (ROM), 
but modern CISC also 

utilizes hardwired control 


Most hardwired without 
control memory 


(ii) | Addressing Limited to 3-5. 


modes 


33-50 MHz in 1992 with 


(iit) | Clock rate and 
a CPI between 2 and 15 


expected CPI 


50-150 MHz in 1993 with 
one cycle for almost all 


instructions and an average 
CPI less than 1.5 
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8-24 GPRs, mostly with a 
unified cache for data and 
instructions, recent designs 
also use split caches 


(iv) | General-purpose | 32-192 GPRs, mostly 
registers (GPRs)| with a split data cache 
and cache design] and instruction cache 


Large set of instructions 
with variable formats 
(16-64 bits per instruction) 


Instruction-set | Small set of instructions 
size and instruc-| with fixed format of 32- 
tion formats 


bits and most register- 
based instructions 


RISC processors use 32-bit instructions that are predominantly register- 
based. The memory-access cycle is partitioned into pipelined access operations 
involving the use of working registers and caches, with some simple addressing 
modes. Employing a large register file and separate I and D-caches benefits 
internal data forwarding and removes unnecessary storage of intermediate results. 
For most RISC instructions, the CPI is reduced to 1 with hardwired control. 

While, in a CISC processor, the large number of instructions used is the 
result of employing variable-format instructions — integer, floating-point and 
vector data — and of using over a twelve different addressing modes. Besides, 
many more instructions access the memory for operands using only few 
GPRs. Therefore, as a result of the long microcodes used to handle the execution 
of some complex instruction, the CPI is high. 


Q.5. Describe the architectural distinctions between RISC and CISC processors. 


Ans. The architectural distinction between the RISC and CISC processors 
is illustrated in fig. 2.1. Since future processors can be designed with features 
from both types, some of the differences may disappear. 


Hardwired 
Control Unit 


Instruction and 
Data Path 


Data Path Ko=D> 


Control Unit 


Instruction 


Cache Microprogrammed 


Control Memory 


(Data) (Instruction) 


Main Memory 


Main Memory 
(a) RISC Architecture 


b) CISC Archi 
Fig. 2.1 (b) rchitecture 
te data and instruction caches are employed 
though, exceptions do exist. In a CISC 
sed for holding both data and instructions. 
r data/instruction path. In other ways, CISC 


Ina RISC architecture, separa 
with different access paths. Al 
architecture, a unified cache is u 
Thus, they must share the simila 


_ comparison with other architecture ? 
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processors may also use split codes. Hardwired control is used in most RISC 
while the use of microprogrammed control is found in traditional CIsc’ 
Therefore, control memory 1s required in earlier CISC Processors that ne 
significantly reduces the instruction execution. Although, mo dem Cig Z 
processors may also use hardwired control. Thus, in RISC machines batimi 
control and split caches are not exclusive. Using hardwired control will deen 
the CPI effectively to one instruction per cycle when Pipelining is performed 
correctly. Some CISC processors also use hardwired control and Split caches 
like the i586 and MC68040. 


0.6. Discuss and compare the characteristics of CISC and RISC 
architectures. (R.GPV, Dec. 20] 5) 
Or 
Discuss and compare the characteristics of RISC and CISC archite itè 
(R.GEBV., June 2016) 
Ans. Refer to Q.4 and Q.5. 


Q.7. What is RISC attributes and discuss the advantages of RISC in 
(R.GP.V., June 2015) 
Ans. RISC attributes are those which contains the features and technology 
of RISC processor. Some of the examples of RISC attributes are Intel i860, 
SPARC, MIPS R3000, IBM RS/6000 etc. One of the special RISC processors 
are the superscalar processors, which allow multiple instructions to be issued 
simultaneously during each cycle per instructions. 
The advantages of RISC in comparison to other instruction architectures 
are — 
(i) RISC instruction are set on 32 bits format, which contains less 
than 100 instructions. 
(ii) Only three to five simple addressing modes are used. 
(iii) A 32 bit registers are used to improve fast context switching 
among multiple users. 
(iv). In RISC most of the instructions are executed in one cycle 
with hardwired control. i 
(v) The RISC processor is implemented on a single VLSI chip, 
due to the reduction in instruction-set complexity. 
(vi) The RISC/superscalar processor benefits a higher clock rate 
and a lower CPI, which lead to higher MIPS ratings. 


0.8. Discuss CISC scalar Processors in brief. 

«Ans, A scalar processor runs with scalar data. The simplest scalar w 
runs integer instructions with the help of fixed point operands. More ap 
scalar processors run both integer and floating-point operations. In eal 
CPU, a modern scalar processor may possess both an integer unit and a floating 
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point unit. A CISC scalar processor is built either with single chip or with multiple 
chips mounted on a processor board depending on a complex instruction set. 

The performance of a CISC scalar processor is similar to the base scalar 
processor in the ideal case. Although, the processor is often underpipelined. 
The main causes of the underpipelined situations are resource conflicts, logic 
hazards, branch penalties and data dependence among instructions. 


0.9. Discuss RISC scalar processors briefly. 


Ans. Scalar RISC is the generic RISC processors as they are designed to 
issue one instruction per cycle, identical to the base scalar processor. Theoretically, 
both CISC and RISC scalar processors should perform about the same when they 
run with equal program length and with the same clock rate. Although, these two 
considerations are sometimes valid because the architecture influences the density 
and quality of code produced by compilers. In both processor architectures, 
instruction-level parallelism is exploited by pipelining. The reliance on a good 
compiler is less required in a CISC processor as compared to a RISC processor. 
The RISC design gains its power by pushing some of the less oftenly used operations 
into software. However, both RISC and CISC cannot perform well as designed 
without a low CPI, high clock rate and good compilation support. 


Q.10. Differentiate between CISC scalar processors and RISC scalar 
processors. . (R.GP.V.,.Dec. 2017) 
Ans. Refer to @.° and Q.9. ee. Aes F ; 


Q.LI. Dis®nguish. between scalar RISC and superscalar RISC in terms 
of instruction sue, pipeline architecture and processor performance. 
E (R.GPV., June 2010, 2013, 2014) 


Ans. Using : superscalar architecture or vector architecture, a RISC scalar 
processor can be improved. The processors which are executing one instruction 
per cycle are called scalar processors. There is only one instruction issued per cycle, 
and only one completion of instruction is expected from the pipeline per cycle. 

Superscalar pru cessor uses multiple instruction pipelines. This specifies 
that multiple instructions are issued per cycle and multiple results are produced 
per cycle. A vector processor runs vector instructions on arrays of data. 


Therefore, a string of repeated operations is involved in each instruction that 
are ideal for pipelining with one result per cycle. 


Q.12. Discuss in detail the VLIW processor architecture. 


Or 

Draw VLIW architecture. (R.GP.V, June 2012) 
Or 

What is the basic concept of VLIW approach ? (R.GP.V., June 2016) 
i! Or 


Write short note on VLIW architecture. (R.GP.V., May 2018) 


60 Advance Computer Architecture (Vi-Sem) 


Ans. The VLIW (very long instruction word) architecture has been 


generalized from two well-established concepts — l 
(i) Horizontal microcoding (ii) Superscalar processing. 

A VLIW machine contains instruction words of hundreds of bits in ence 
In a VLIW processor, multiple functional units are employed Ae yas 
shown in fig. 2.2 (a), and all functional units uses a common p a 
file. The operations to be simultaneously run by the STE ee are 
synchronized in a VLIW instruction, for example, 256 or e its per 
instruction word, as implemented in the multiflow computer models. 


| B =] 
areal eeooeen A 


q Register File 
LoadiStore| FP Add [FP Multiply | Branch | [integer ALU 


Main Memory 


Decode Execute 


EA 
A 


Ifetch 


Time in 
Base Cycles 
0123456789 
(b) VLIW Execution with Degree m = 4 
Fig. 2.2 VLIW Processor Architecture and its Pipeline Operations 


The VLIW concept is inherent from horizontal microcoding. The different 
fields of the long instruction word carry the operation codes to be dispatched 
to different functional units. Programs which are written in traditional short 
instruction words (32-bits) should be compacted together to form the VLIW 
instructions. This code compaction is performed using compiler that can predict 
branch outcomes employing run-time statistics or elaborate heuristics. 


Pipelining in VLIW Processors — Fig. 2.2 (b) illustrates the execution of 
instructions by an ideal VLIW processor, and each instruction defines multiple 
operations. VLIW machines behave very similar to superscalar machines. In a 
VLIW architecture, data movement and instruction parallelism are given at compile 
time. Therefore, synchronization and run-time resource scheduling are completely 
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removed. A VLIW processor can be consider as an extreme of a superscalar 
processor where all unrelated or independent operations are already 
synchronously compacted together in advance. The VLIW processor CPI can 
be even lower as comparison to a superscalar processor. 


VLIW Opportunities — Random parallelism among scalar operations is 
exploited in a VLIW architecture. The VLIW processor success mainly relies 
on the efficiency in code compaction. With any conventional general-purpose 
processor, this VLIW architecture is totally incompatible. 


A VLIW processor can remove the hardware or software requirement to 
detect parallelism by explicitly encoding parallelism in the long instruction. The 
benefit of VLIW architecture is its simplicity in instruction set and hardware 
structure. In scientific applications, the VLIW processor can potentially perform 
well. The architecture may not be able to perform satisfactory in general-purpose 
applications. The VLIW architecture has not entered the mainstream of computers 
because of its lack of compatibility with traditional hardware and software. 


Q.13. What are the limitations of VLIW ? (R.GPV., June 2015) 


Ans. Refer to Q.12. 


0.14. Explain the difference between superscalar and VLIW architecture 


in terms of hardware and software requirements. 
(R.GP.V., June 2013, Dec. 2014, June 2017) 


Ans. The differences between superscalar and VLIW architectures are 
as follows — 

(i) The superscalar instructions decoding is difficult as comparison 
to VLIW instructions. 

(ii) A superscalar machine is object-code-compatible with a big family 
of nonparallel machines. In contrast, a VLIW machine exploiting various 
amounts of parallelism would need various instruction sets. 

(iii) The code density of the superscalar machine is good if the available 
instruction-level parallelism is less than that exploitable by the VLIW machine. 
This is because the fixed VLIW format comprises bits for nonexecutable 
operations, whereas the superscalar processor issues only executable instructions. 


Q.15. Compare superscalar and VLIW. processors. (R.GP.V., Dec. 2016) 
Ans. Comparison between superscalar and VLIW processors are as follows — 


VLIW Processor 


It is subclass of RISC processors | Very long instruction word pro- 
in which multiple instructions can | cessor uses extra functional unit 


be executed simultaneously in each | than superscalar processors. 
cycle. 


Superscalar Processor 
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The clock rate of superscalar pro- The clock rate of VLIW pro- 


ii 
= cessors can be higher than VLIW |cessors are usually lower because 
processors and same as the scalar | of the use of read-only-memory 
RISC processors. (ROM) in microprogrammed 
control. 
(iii) | The CPI of superscalar processor The CPI of VLIW processors 


can be lowered with the use of 


is higher than that of VLIW pro- 
extra functional units. 


cessors. 


0.16. Discuss the memory hierarchy technology in computer system. 


Ans. The memory hierarchy of storage devices includes tape units, disk 
devices, main memory, caches, and registers. To characterize the storage 
organization and memory technology at each level following five parameters 
are taken into account — memory size (s;), access time (t,), cost per byte (c,), 
transfer bandwidth (b,) and unit of transfer (x;). 

Memory size is the number of words or bytes in level /. The round-trip 
time from the CPU to the /th level memory is the access time. By determining 
the product of memory size and cost per byte, the cost of the /th memory is 
estimated. The rate at which information is transferred between adjacent levels 
is the’ bandwidth. The unit of 
transfer specifies the grain size 
for data transfer between levels 
Jand/+ 1, 
~ Fig. 2.3 shows a memory , wo 9 
hierarchy with increasing 
capacity and decreasing cost Level 1 
from low level to high level. At 
a lower level, memory devices Level 2 
are smaller in size, faster to 
access but more costly per 
byte, having a higher bandwidth 
and using a smaller unit of 
transfer as compared to the 
higher level. 

Registers and Caches — The register and the cache are parts of the 
processor complex, built either on the processor board or on the processor 
chip. Assignment of register is done through the compiler. After decoding the 


Main Memory 


Disk Storage 


Level 3 _ (Magnetic, Solid-state) 


Tape Units 
(Optical Disks, Magnetic Tapes) 


Capacity 
Fig. 2.3 Memory Hierarchy 


Level 4 
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instructions, register transfer operations are directly managed by the processor. 
Usually, register transfer is conducted in one clock cycle at processor speed. 
As a result, registers would not consider as a level of memory by many number 
of designers. On the basis of speed and application needs, the cache is 
implemented at one or more levels. The cache is transparent to programmer 
and is managed by the memory management unit. 

Main Memory — Sometimes, the main memory is also known as the 
primary memory. This is much more in size as compare to the cache and 
often implemented by the most cost-effective RAM chips. The main memory 
is handled by a memory management unit in cooperation with the operating 
system. Other alternatives are available to extend the main memory by using 
more memory boards to a computer system. Sometimes, it is partitioned into 
two sublevels itself employing various memory technologies. 

Disk Drives and Tape Units — Disk drives and tape units are managed 
by the operating system (OS) with limited user intervention. The highest level 
of on-line memory is disk storage, which contains the system programs like 
the compilers and OS and some user programs and their data sets. The magnetic 
tape units are used for backup storage and are off-line memory. They maintains 
processed results and files, and copies of present and past user programs. A 
workstation system contains the hard disks in an attached disk drive, and the 
cache and main memory on a processor board. User intervention is required 
to access the magnetic tape units. 


Q.17. Distinguish between SRAM and DRAM. (R.GP.V., Dec. 2016) 


Ans. RAMs come in two varieties, static and dynamic. Static RAMs’ 
(SRAMs) are constructed internally using circuits similar to the basic D flip- 
flop. They have the property that their contents are retained as long as the 
power is kept on. Static RAMs are very fast. Their typical access time is few 
nanoseconds (nsec). Due to this reason, static RAMs are popular as level 
2(L2) cache memory. 

In contrast, dynamic RAMs (DRAMs) do not use flip-flops. A dynamic 
RAM is an array of cells, each of which containing one transistor and a tiny 
capacitor. The capacitors can be charged or discharged, thus allowing 0’s and 
l’s to be stored. As the electric charge tends to leak out, each single bit in a 
dynamic RAM must be refreshed or reloaded every few milliseconds to prevent 
the data from leaking away. Because external logic must take care of the 
refreshing, dynamic RAMs need more complex interfacing than that needed 
by static RAMs, although in many applications, this disadvantage is equalized 
by their larger capacities. . 
ae wie need only one transistor and one capacitor per bit (in contrast 

per bit for the best SRAM), they have a very high density 
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e to this reason, main memories are nearly always built 
this large capacity has a price, i.e., DRAMS are slow 
bination of a SRAM cache and a DRAM 
ood properties of each. 


(many bits per chip). Du 
out of DRAMs. However, 
(tens of nanoseconds). So the com 
main memory attempts to combine the g 


0.18. Discuss the inclusion property. 


is described as Mj Misi C Miya C.-C M}, 


The set inclusion relationship specifies that all information items are originally 


hold in the outermost level My. Subsets of M; are copied into M i during the 
subsets of M;_; are copied into M} 9, etc. 


y that ifan information word is present in 


processing. In a similar way, 


eaan ae 


In an alternative way, we can sa 
M;, then copies of the similar word is also present in all upper levels Mist 
Miz- Mp but, the vice versa is not true. A word miss in M; specifies that 
it is also missing from all higher levels Mj), M,49,-.-. My). The highest level is 
use for backup storage, where all information words are present. 


Access by word (8 Bytes) from 
a cache block of 32 Bytes, such 
as block a and b. 


a| 
My : Cache a 
E Access by block (32 Bytes) 
it fal [| from a memory page of , 


32 block or 1 KBytes, 


Access by page (1 KBytes) 
from a file consisting of 
many pages, such as 
page F and page G 
in segment I. 

M3: 
Disk Storage 


Segment H Segment transfer 


with different 
Seve Mto cost asasen number of pages. 
M4 : Magnetic; 
Tape Unit 
(Backup Storage) 


Segment H 


Fig. 2.4:The Inclusion Property and Transfer of Data between 
Adjacent Memory Hierarchy Levels 
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Transfer of information between the CPU and cache is in terms of words. 
The size of the word is 4 bytes or 8 bytes based on the machine word length. 
The cache (Mj) is partitioned into cache blocks, typically 32 bytes and are 
known as cache lines. As depicted in fig. 2.4, blocks (like a and b) are the 
units of data transfer between the cache and main memory. The main memory 
(M3) is partitioned into pages of 4 Kbytes each which contains 128 blocks. 
The unit of information transferred between disk and main memory is page. 
In the disk memory, scattered pages are organized as a segment. For example, 
segment I has page F, page G, and other pages. Based on the user needs, the 
size of the segment varies. Fig. 2.4 shows that, data transfer between the disk 
and the tape unit is controlled at the file levels like segments H and I. 


0.19. Describe memory coherence property in a multilevel memory 
hierarchy. 
Or 
Explain memory coherence requirements in a multilevel memory 
hierarchy. Distinguish between write-through and write-back policies in 
maintaining the coherence in adjacent levels. 
Or 
What do you understand by coherence ? Explain briefly. 
(R.GP.V, June 2014) 


Ans. It is needed by the coherence property that copies of the same 
information item be consistent at successive memory levels. Copies of the 
word that are modified in the cache should be updated eventually or immediately 
at all higher levels. The hierarchy should be maintained as such. To reduce the 
effective access time of the memory hierarchy, frequently used information is 
often found in the lower levels. In a memory hierarchy, there are two methods 
for maintaining the coherence — 


(i) Write-through (WT) Method — This method requires immediate 
update in M;,,; when a word is modified in M;, for i = 1, 2,...., 1. 


(ii) Write-back (WB) Method — This method delays the update in 
M;,, until the word being modified in M; is eliminated or replaced from M;. 


Also refer to Q.4, Unit-IV. 


Q.20. Explain the inclusion property and memory coherence 
requirements in a multilevel memory hierarchy. (R.GP.V., Dec. 2010, 2016) 
Or 

Explain the inclusion property and memory coherence requirements in 

a multilevel memory hierarchy. Distinguish between write through and write 
back policies in maintaining coherence in adjacent levels. 

(R.GBRV, June 2017) 
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Or 
What are inclusion’ property and memory coherence requirem 
Distinguish between write through and write back policies, 

4 : (R.GP.Y, May 2018) 
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Ans. Refer to Q.18 and Q.19. 


0.21. Distinguish between write through and write back Policies ; 
maintaining the coherence inadjacent levels. Also explain the basic cone, 
of paging and segmentation in managing the physical and virtual mem sri 
in a hierarchy. (R.GP.V., June 20 10) 


Ans. Difference between Write-through and Write-back Policies _ | 


Refer to Q.19. 

Paged Memory System (or Paging) — The technique for partitionin 
both the physical memory and virtual memory into fixed-size pages is callie 
paging. Exchange of information between pages is performed at the page 
level. Page tables (PTs) are employed to map between pages and page frames 
In the main memory, page tables are implemented upon creation of the ‘a 
processes in application programs. The number of page tables maintained in 
the main memory are very large because many user processes may be created 
dynamically. The page table entries (PTEs) are identical to the TLB entries, 
containing required (virtual page, page frame) address pairs. It should be noted 
that both PTEs and TLB entries need to be dynamically updated to show the 
latest memory reference history. In these translation maps only snapshots of 
the history are maintained. 

A page fault is announced when the required page is not present in page 
table. A page fault indicates that the referenced page is not found in the main 
memory. A process that is running is suspended when a page fault takes place. 
A context switch is performed to another ready-to-run process when the 
missing page is sent from the disk or tape unit to the physical memory. This 
direct page mapping can be extended with multiple levels of page tables. Since 
multiple memory references are required to access a sequence of page tables, 
multilevel paging takes a longer time to create the required physical address. 
The reason behind multilevel paging is to extend the memory space and to 
offer more sophisticated protection of page access at various levels of the 
memory hierarchy. 


Segmented Memory System (or Segmentation) — Many pages are 
shared by segmenting the virtual address space among multiple user programs 
at a time. A segment of scattered pages is made logically in the virtual memory 
Lan Segments are given by users to declare a part of the virtual address 

e. 
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User programs are logically structured as segments in a segmented memory 
gments can have variable lengths and can invoke each other. Because 


system. Se 
rm segment size, the management of a segmented memory 


of the nonunifo 


system is too difficult. : 
In the virtual address space, segments providing logical structures of 


data and programs. Segments area user-oriented concept. On the contrary, 
paging makes easy the physical memory management. In a paged system, all 
page addresses make a linear address space within the virtual space. 

The segmented memory is organized as a two-dimensional address space. 
In this space, each virtual address contains a prefix field known as the segment 
number and a postfix field known as the offset within the segment. In each 
segment, the offset addresses form one dimension of the contiguous addresses. 
The segment numbers, not essentially contiguous to each other, form the 
second dimension of the address space. 
chanism in a virtual 
memory. (R.GPV., Dec. 2016) 


Ans. Various virtual address translation mechanisms. are shown in fig. 
2.5. Translation maps are required for virtual address translation which are 
saved in main memory, cache memory or in associative memory. Mapping 
functions are required to generate pointer to the correct translation map. 
Congruence or hashing are the two implementation of mapping function, Hashing 
into linked list is provided by congruence function and hashing function 
transform long page number into shorter one and gives a unique hashed number 


which can be used as pointer. 


Virtual 
Address 


-».Q.22.. Explain various address translation me 


Pointer Translation 


Map 


Mapping 
Function 


Congruence Hashing Direct TLB Inverted 
: Mapping (ATC) Mapping 
One Level Multi-Level Inverted Associative 
Page Table Page Table Page Table Page Table 


Fig. 2.5 Virtual Address Translation Mechanism 


(i) Direct Mapping — Direct mapping uses one level page table or 
multilevel page table. Both types of page tables can be implemented in paging 
segmentation and page segment mapping schemes. l 

In paging physical and virtual memory are divided into multiple fixed-size 
pages. Pages and page frames are mapped by using page tables. Multilevel 
paging requires more time than one level paging for generating physical address 
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because it needed multiple page tables to access which requires multiple memory 
references. In segmentation uses program is divided in scgments of variable 
length. Virtual address of segmented memory is combination of segment number 
and offset. The features of paging and segmentation arc combined in paged 
system. In this the virtual address is divided into three fields segment number, 
page number and offset. It combines the advantages of pages memory and 


segmented memory. 

Refer to Q.21. 

(ii) Inverted Mapping — Since direct mapping can’t handle large 
virtual address spaces efficiently Segment Register Virtual Address 
because large virtual address Ts [K [Segment 1D] | Offset 
spaces either require multilevel 7 
page table or large page table 
which reduces the performance. 
To overcome this inverted 
mapping is used, which is 
implemented by either inverted 
page table or associative page 
table. Inverted mapping is shown 
in fig. 2.6. 

By using hashing function or an associative search inverted page tables 
can be accessed. In this technique long virtual address is converted into 
short physical address using segment register. It provide a segment id that 
convert 4 bit sreg into long virtual address. Inverted mapping implementation 
can be accomplish either by inverted page table or by associative page table. 

(iii) TLB (Translation Lookaside Buffer) — In this technique TLB 
and page table works as translation map. TLB saves referenced page entries 
which are most commonly or most recently used. The address translation 
using TLBs and page tables is shown in fig. 2.7. Virtual address is partitioned 
into three parts — Virtual page number, Cache block number and word address. 


Segment ID 


Physical Address 
Fig. 2.6 Inverted Mapping 


Physical 
Address 


Page 
Frames 


Virtual 
Address 


Fig. 2.7 Address Translation Using TLB and PTS 
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First virtual page number searches in TLB for required match. If the match 
found (hit) in TLB, then the physical page number is obtained from the matched 
entry of TLB and physical address formed by combining physical page number 
with the block and word of virtual address, If no match found (miss), then a 
pointer identifies a required page in page table from where the required page 
frame can be obtained. 


0.23. Explain the term locality of references with its properties. 
Or 
Explain the temporal locality, spatial locality and sequential locality 


associated with program/data access in a memory hierarchy. 
(R.GP.V., June 2010, 2014, May 2018) 


Ans. The memory hierarchy was developed on the basis of a program 
behaviour called as locality of references. Memory references are produced 
by the CPU for either data or instruction access. In specific portions, these 
accesses tend to be clustered in time, space and ordering. In other way, 
many programs work in favor of a definite regions of their address space at 
any time window. In 1990, Patterson and Hennessy have pointed out a 90- 
10 rule which says that a typical program may spend 90% of its execution 
time on only 10% of the code like the innermost loop of a nested looping 


operation. 

Three dimensions of the locality property are — sequential, spatial and 
temporal. Many pages are used dynamieally during the lifetime of a software 
process and the references of these pages changes from time to time, although, 
they follow certain access patterns. These memory reference patterns are 
occurs due to the following three locality properties — 

~ (Œ Sequential Locality — The execution of instructions in typical 
programs follows a sequential order unless branch instructions generate out-of- 
order executions. In ordinary programs, the ratio of in-order to out-of-order 
execution is 5 to 1. Besides, a large data array will also access in sequential order. 


(ii) Spatial Locality — Spatial locality defines the tendency for a 
process to access items whose addresses are near one another. For instance, 
operations on arrays or tables includes accesses of a certain clustered area in 
the address space. Program segments (like macros or routines) tend to be 
record in the same neighbourhood of the memory space. 


(iii) Temporal Locality — In the near future recently referenced data 
or instructions are likely to be referenced again. This is often caused by special 
program constructs like subroutines, process stacks, iterative loops. A small 
code segment will be referenced again and again, once a loop is entered or a 
subroutine is invoked. Thus temporal locality goes to cluster the access in the 
recently used areas. 
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Q.24. Explain locality of reference and memory hierarchy. 
(R.GP.V., June 2015) 
Ans. Refer to Q.23 and Q.16. 


0.25. Discuss the optimization of the capacity of a memory hierarchy 
subject to a cost constraint. 
Ans. The total cost of a memory hierarchy is calculated as — 


C= > Si 

It indicates that the cost is distributed over n levels. Because c; > c, > 
C3>...C,, SO we have to select s} < S2 < S3 <....Sp. The optimal design of a 
memory hierarchy result in a Teat near to the ti of M, and a total cost near to 
the c, of M,. In fact, this is difficult to obtain because of the tradeoffs among 
n levels. 

Given a ceiling Cg on the total cost, the optimization process can be 
formulated as a linear programming problem. It means a problem to minimize 


n 
Teat = doje fiti 
subject to the following constraints — 
s; > 0, t; > 0 for i= 1, 2,...., 0 


i n 
D Crot= Pj- Cisi < Co 


| ! 0.26. Define the following terms — 
| (i) Hit ratios (ii) Access frequency (iii) Effective access time. 
| Or - 


Ans. (i) Hit Ratios — The concept of a hit ratio is defined for any two 
adjacent levels of a memory hierarchy. When an information item is obtained 
in M,, it is called a hit, otherwise, it is a miss. Assume memory levels M; and 
M,_, in a hierarchy for i = 1, 2, ....n. The probability that an information item 
will be obtained in M; is the hit ratio (h;) at Mj. This is a function of the 
characteristics of the two adjacent levels M;_, and Mj. The miss ratio is 

| defined as 1 — h; at Mj. 

At successive levels, the hit ratios are a function of program behaviour, 
management policies, and memory capacities. Successive hit ratios are 
independent random variables having values between 0 and 1. It is considered 
that họ = 0 and h, = 1 to simplify the future derivation. This means that the 
CPU always accesses m; first and the access to the outermost memory M, is 
always a hit. 


(ii) Access Frequency — The access frequency is described as {= 
(1 — hy) (1 — hy) ..... (1 — hj_)) h; to memory level M;. This is the probability 
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lly accessing M; when i- | misses exists at the lower levels and 


of successfu 


n 
a hit at M;. It is noted that 7 fi =1 and f = hy. 
i=l 
The access frequencies reduces very quickly from low to high levels because 
of the locality property. That is, fi >> fy >> f3 >>...>> f,. It specifies that the 
inner levels of memory are accessed more often as compared to the outer levels. 
(iii) Effective Access Time — Practically, it is required to accomplish 
as high a hit ratio as possible at M}. A penalty must be paid to access the next 
higher level of memory when a miss occurs. The misses are called page faults in 
main memory and block misses in the cache because pages and blocks are the 
units of transfer between these levels of memory. Due to the fact that t; < t2 < 
the time penalty for a page fault is higher than that for a block miss. 
The effective access time of a memory hierarchy using the access 
frequencies fj, 1.= P is defined as — 


Ta te 
= h;tı+ (1 — hy) bot, + (1 -h(l — hg) bgt; +... 
+a- U- ae byt 


0.27. Define the term hit ratio and miss ratio. (R.GP.V., Dec. 2015) 


ty, 


Ans. Refer to Q.26 (i). 
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0.28. Explain interleaved memory organization. Justify the use of 
interleaved memory organization. (R.GRVK, June 2011) 


Or 

What is memory interleaving ? (R.GPRV, June 2015) 
Or 

Explain the memory interleaving technique. (R.GPRV, Dec. 2015) 
Or i 


Describe interleaved memory organization briefly. 


(R.GBV., June 2012, 2013) 
Or 


Write short note on memory interleaving. (R.GPV, Dec. 2017) 


ane ist with multiple modules. These memory 
modules are associated with a switching network or a system bus to which 


at 
Poon 


ewww 
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other peripherals like 1/O_devices or processors are also connected. Each 


memory module returns with one word per cycle, once presented with a 
memory address. It is also possible to present various addresses to various 
memory modules. Therefore, parallel access of multiple words can be 
performed in a pipelined manner or simultaneously. 

Suppose a main memory made with m = 22 memory modules, each having 
w = 2> words of memory cells. The total capacity of memory js 
m.w = 28*b words. These memory words are allocated linear addresses. Various 
ways of allocating linear addresses have various memory organizations. 

Two interleaved memory organizations with m = 24 modules and w = 2b 
words per module (word addresses shown in boxes) are shown in fig. 2.8, 
Low-order interleaving as shown in fig. 2.8 (a) spreads contiguous memory 
locations across the m modules horizontally. It specifies that the low-order 
‘a’ bits of the memory address are employed to recognize the memory module 
and the high-order ‘b’ bits are the word addresses within each module. It is 
noted that the similar word address is applied to all memory modules at a 
time. To distribute module addresses, a module address decoder is used. 

High-order interleaving employs the high-order ‘a’ bits as the module address 
and the low-order ‘b’ bits as the word address within each module as shown in 
fig. 2.8 (b). Therefore, contiguous memory locations are allocated to the same 
memory module. Only one word is accessed from each module in each memory 
cycle. Therefore, the high-order interleaving does not support block access of 
contiguous locations. In contrast, the low-order m-way interleaving does support 
block access in a pipelined manner. 
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(a) Low-order m-way Interleaving (The C-access Memory Scheme) 
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(b) High-order m-way Interleaving 


Fig. 2.8 Interleaved Memory Organizations 


In computing, interleaved memory is a design made to compensate for 
the relatively slow speed of dynamic random-access memory (DRAM) or 
core memory, by spreading memory addresses evenly across memory banks. 
That way, contiguous memory reads and writes are using each memory bank 
in turn, resulting in higher memory throughputs due to reduced waiting for 
memory banks to become ready for desired operations. 


It is different from multi-channel memory architectures, primarily as 
interleaved memory is not adding more channels between the main memory 
and the memory controller. However, channel interleaving is also possible, for 


example in freescale i.MX6 processors, which allow interleaving to be done 
between two channels. 


With interleaved memory, memory addresses are allocated to each memory 
bank in turn. For example, in an interleaved system with two memory banks 
(assuming word-addressable memory), if logical address 32 belongs to bank 0, 
then logical address 33 would belong to bank 1, logical address 34 would belong 
to bank 0, and so on. An interleaved memory is said to be n-way interleaved 
when there are n banks and memory location i resides in bank i mode n. 


Q.29. Explain the 


£ following terms associated with cache and memory 
architectures — 


g) Low order memory interleaving 
(ii) Physical address cache versus virtual address cache. 


(R.GP.V.,. June 2017) 
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Ans. (i) Low Order Memory Interleaving — Refer to Q.28. 
iii) Physical Address Cache versus Virtual Address Cache 
Physical Address Cache Virtual Address Cache 


This is accessed with a 


memory address. Virtual 
Oita 


Address translation in MMV and 


This is accessed with a physical 


memory address. a 
After the address translation in 


patente names pear TE neers 
MMU or TLB, the cache lookup | cache lookup occurred parali] 
occurred; D eins *. 
Cache is tagged and indexed with) Cache is indexed and tagged wit, 


physical address. 
Since data have unique index/tag 


neon, 


in cache;so there is no aliasing 
problem 

Since, there-is no aliasing problem. 
So, there is no need to perform 


eat ae 


cache flushing. 


Cache flushing is re uired periodi- 


cally to overcome the aliasing 
problem. =~ 


oo 


_ 0.30. How does interleaved memory organization provide pipelined 
access of the parallel memory modules ? 


Ans. Access of them memory modules are overlapped in a pipelined manner, 
That is why, the memory cycle or the major cycle is partitioned into m minor 
cycles. Fig. 2.9 (a) illustrates an eight-way interleaved memory with m = 8 and 
w = 8 and hence 
a=b=3. Consider 
@ be the major 
cycle and t the 
minor cycle. These 
two cycle times are 
related as — 

rn) 

“a 

T 
here, m denotes the 
degree of inter- 
leaving. Fig. 2.9 (b) 
illustrates, the 
timing of the pipe- 
lined access of the 
eight contiguous 
memory words in a 


Memory Address Register (6 bits) 


PoTigt2zt3t4t 5) 


Word Module 
Address Address 


i Memory Data Register : 
C-access memory. (a) Eight-way Low-order Interleaving (Absolute Addres 


C-access mem 
. ory Shown in Boxes) 


Die 


scheme is a concurrent 
access of contiguous words. 
The total time needed to 
complete the access of a 
single word from a module 
is the major cycle œ. The 
actual time required to 
produce one word is the 
minor cycle t, assuming 
overlapped access of 
successive memory modules 
separated in every minor 
cycle t: 

It is noted that the pipelined access of the block of eight consecutive 
words is sandwiched between other pipelined block accesses before and after 
the present block. The effective access time of each word is decreased to t 
because the memory is contiguously accessed in a pipelined manner, however, 
the total block access time is 2. 


0.31. What do you understand by memory bandwidth ? 


Ans. The memory bandwidth (B) of an m-way interleaved memory is lower- 
bounded by 1 and upper-bounded by m. The approximation of B by Hellerman is 


B= m”°=J/m (i) 


in equation (i) m denotes the number of interleaved memory modules. This 
equation indicates that the efficient memory bandwidth is approximately two 
times that of single module when four memory modules are used. 


(b) Pipelined Access of Eight Consecutive Words 


Fig. 2.9 Multiway Interleaved Memory 
Organization and the C-access Timing Chart 


This pessimistic estimate is because of the fact that block access of 
different lengths and access of single words are randomly mixed in user 
programs. Hellerman’s calculation was depend on a single-processor system. 
The effective memory bandwidth decreased again, if memory-access conflicts 
from multiple processors are considered. 


Q.32. Define the terms — Access time, bandwidth. (R.GP.V., June 2016) 
Ans. Refer to Q.26 (iii) and Q.31. 
Q.33. Discuss the term fault tolerance in memory interleaving. 


Or 
What is fault tolerance ? (R.GPR.V., Dec. 2015) 


Ans. To achieve various interleaved memory organizations, low-order 


and high-order interleaving are combined. In each memory module, sequential 
addresses are allocated in high-order interleaved memory. This makés it simple 
to isolate faulty memory modules in a memory bank of m memory modules. If 


nce Computer Architecture (Vi-Sem) 


lure is detected, the remaining modules can still he Used 

openinga window in the address space. This fault isolation cannot be pe form 
in a low-order interleaved memory, where a module failure may paralyze ri 
complete memory bank. Hence, low-order interleaved memory is not faut 


tolerant. 
0.34. Explain backplane bus system briefly. 


76 Adva 


one module fai 


(R.GB.V., June 2010, Dec 
Or 2015) 
Describe backplane bus system. (R.GPV., June 2012) 
Or . 


Draw and explain block diagram of backplane bus system. Also describe 
bus arbitration and control. (R.GP.V., May 201 8) 

Ans. In a tightly coupled hardware configuration, a backplane bus 
interconnects processors, peripheral devices, and data storage. The system 
bus must be designed in such a way that it permit communication between 
devices on the bus without disturbing the internal activities of all the other 
peripherals connected to the bus. Operational rules must be set to guarantee 
orderly data transfers on the bus. Timing protocols must be employed to 
arbitrate among multiple requests. 
Bus Controller 


Memory Board 


CPU Board 


Daisy Chain 
Driver, System 
Clock Driver, 
Arbiter, Power 


Processor 
and Cache 


x s Other Boards B 
Functional Functional Driver, Bus 
for CPU, 
Modules Modules Memory and Timer 
V/O, ete. 


Interface 


2 


Data Transfer Bus (DTB) 
(Data, Address and Control Lines) 


Backplanes (Signal Lines and Connectors) 


Fig. 2.10 Backplane Bus System 
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2.10 shows the backplane buses, system interfaces and slot 


Fig. 
sonhectioti to various functional boards in a multiprocessor system. Signal 
n the backplane are often functionally grouped into various buses. The 
depicted here are very similar to those proposed in the 64-bit 
1990). On the backplane, different functional 
t has one or more connectors for inserting 


the boards as illustrated by the vertical arrows in fig. 2.10. 


lines 0 
four groups 
VME bus specification (VITA, 
boards are plugged into slots. Each slo 


Data Transfer Bus — In a VME bus, data, control and address lines form 
the data transfer bus (DTB). The data lines are often proportional to the memory 
word length. The addressing lines are employed to broadcast the data and 
device address. The number of address lines is proportional to the logarithm 
of the size of the address space. Address modifier lines are used to define 


special addressing modes. 

Bus Arbitration and Control — Arbitration is the process of allocating 
control of the data transfer bus to a requester. To coordinate the arbitration 
process among various requesters, dedicated lines are reserved. The requester 
is known as a master and the receiving end is known as a s/ave. For handling 
the interrupts, interrupt lines are used. To synchronize parallel activities among 
the processor modules, dedicated lines are used. Utility lines comprises signals 
which give periodic timing and coordinate the power-up and power-down 
sequences of the system. The backplane is composed of power lines, signal 
lines and connectors. To house the backplane control logic like the arbiter, 
power driver, bus timer and system clock driver, a special bus controller 


board is used. 

Functional Modules — Fig. 2.10 illustrates that the functional module is 
a group of electronic circuitry which resides on one functional board and 
works to accomplish special bus control functions. Special functional modules 
are — 

(i) Arbiter — It is a functional module which accepts bus requests 

from the requéster module and give control of the data transfer bus to one 
requester at a time. 


(ti) Interrupter—This module creates an interrupt request and gives 


status/ID information when an interrupt handler module requests it. 


(iii) Bus Timer — It calculates the time taken by the each data transfer 
on the data transfer bus and terminates the data transfer bus cycle when a 
transfer takes long time. 


(iv) Location Monitor — This functional module used to monitors 


data transfers over the data transfer bus. A power monitor watches the status 


of the power source and signals when power becomes unstable. 


j 
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i) System Clock Driver — This module offers a clock timing signa] 
‘board interface logic is required to match the signal 


on the utility bus. Besides, ICIS 
line impedance, the propagation time and termination values between the plug- 


in boards and the backplane. 
Physical Limitations — A limited number of boards are plugged into a 


single backplane because of the electrical, mechanical and packaging 
limitations. Multiple backplane buses can be mounted on the same backplane 


chassis. 


0.35. Describe the addressing and timing protocols of backplane bus 
(R.GPV., Dec. 2010) 


system. 
Or 


Explain addressing and timing protocols briefly. 
(R.GP.V., June 2012,.2014, 2015) 
Ans. Theré are two kinds of printed-circuit boards attached to.a bus ~ 
active and passive. Active boards such as processors can work as bus 
masters as well as slaves depending on time. Passive boards such as memory 
boards can work only as slaves. The master can begin a bus cycle and the 
slaves reply to requests by a master. Only one master can handle the bus 
at a time. Although, one or more slaves can reply to the master’s request 


simultaneously. 
Bus Addressing — The bus cycle is a backplane bus which is driven by 


a digital clock with a fixed cycle time. The electrical, mechanical and 
packaging properties of the backplane determines the bus cycle. There may 
be unequal delays from the source to the destination in signals traveling on 
the bus lines. The backplane is designed in such a way that a limited physical 
size will not skew information with respect to the associated strobe signals. 
Cycles on parallel lines in various buses may overlap in time to speed up the 
operations. Factors responsible for the bus delay are — the source’s line 
drivers, the destination’s receivers, the bus loading effects, line length and 
slot capacitance. i 

For data transfers, all the bus cycles are not used. For optimizing 
performance, the bus should be designed to reduce the time needed for request 
handling, interrupts, addressing and arbitration. Therefore, for useful data 
transfer operations, many bus cycles are employed. 

Boards are identified using a slot number. The board is chosen as a slave 
ee Te matches the contents of high-order address lines. This 
ae A poy $ a T acl the allocation of a logical board address 

l , that will enhance the application flexibility. 


a N 
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Broadcall and Broadcast — Only one master and one slave are included 
most bus transaction. Although, a broadcall is a read operation which 
olve a number of slaves keeping their data on the bus lines. Special OR 
ND operations over these data are carried out on the bus from the 
selected slaves. Multiple interrupt sources are detected using: broadcall 
operations. A broadcast isa write operation including multiple slaves which 
is needed in implementing multicache coherence on the bus. A typical timing 
sequence is depicted in fig. 2.11, when information is sent over a bus from 
a source to a destination. Most bus timing protocols implement this sequence. 
ming protocols are required to synchronize master (source) and slave 


in 
inv’ 
or A 


Ti 
(destination) operations. 


1. Send request to bus. 


2. Bus assignment. 


3, Load address/data on bus. ; 
’ 4. Choose slave after signal stabilization. _ 


6. Take stabilized data. - 


5. Signal data transfer. 
7. Acknowledge data taken. 


Time 


8. Knowing data taken, eliminate 


data and leave the bus. 
9. Knowing data eliminated. 


10. Signal transfer finished. 
and leave the bus. 


11. Send next bus request. 


e 
eee 


Slave 


Fig. 2.11 Timing Sequence 


Synchronous Timing — Fig. 2.12 (a) illustrates that all bus transaction 
steps occur at fixed clock edges. The clock signals are transmitted to all 
potential masters and slaves. The slowest device connected to the bus determines 
the clock cycle time. The master employs a data-ready pulse to initiate the 
transfer after the data becomes stabilized on the data lines. The data-accept 
pulse is used by a slave to signal completion of the bit information being 
transferred. A synchronous bus is appropriate for connecting devices having 
relatively the similar speed. Otherwise, the slowest device will slow down the 
complete bus operation. A synchronous bus is easy to control, needs less 
control circuitry and hence less expensive. 


Peet cased ada Timing —Fig.2.12 (b) illustrates that asynchronous timing 
timing bee a eee or handshaking mechanism. In asynchronous bus 
seas aoe oes clock cycle is required. From the master, the rising edge (1) 
spe ites “Teady signal triggers from the slave, the rising edge (2) of the data- 

Pt signal. The falling edge (3) of the data-ready clock and the removal of 


— 
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d by the second signal. The trailing ed 


bus is triggere Jeu . ge (4 

data from i ock is triggered by the third signal. This four-edge aes the be sent by each potential master. Although, all requests share the similar 
data-accep rocess is repeated until all the data are transferreq aking bus-request line. The bus-request signals the rise of the bus-grant level as 
(interlocking) p ui Data Bit lista : shown in fig. 2.13 (b), which in turn raises the bus-busy level. In a daisy 
Data | p chain, a fixed priority is set from left to right. A device can be granted bus 

i Line į i tenure only when the devices on the left do not request bus control. The 

bear j ] bus-busy level is lowered when the bus transaction is complete, which triggers 

Data i i i i the falling of the bus grant signal and the subsequent rising of the bus- 

sive Accept-—+ ; i ; request signal. 


© Cyclłel Cycle 2 Cycle 3 


(a) Synchronous Bus Timing with Fixed-length 
Clock Signals for all Devices 
Data Bit ; Data Bit 


Data Bit 


Central Bus 
Arb: 
i 
= 


Data 

Line 

Master Data 
Ready 


Data 
sive accent 
Cycle 1 Cycle 2 Cycle 3 Bus 
(b) Asynchronous Bus Timing with Variable-length Request 
Signals for Different-speed Devices 
Fig. 2.12 Synchronous and Asynchronous Bus Timing Protocols 
The benefit of employing an asynchronous bus is in the freedom of using 


variable-length clock signals for different speed devices. This does not introduce 
any response-time limitations on the source and destination. It enables both 


Bus Busy 


Data Transfer Bus 


(a) Daisy-chained Bus Arbitration 


slow and fast devices to be attached on the same bus, It is less prone to noise. (b) Bus Transaction Timing 
At the expense of increased costs and complexity, an asynchronous bus Fig. 2.13 Central Bus Arbitration 
provides better application flexibility. The benefit of this arbitration scheme is its simplicity. In the daisy chain, 


0.36. What is arbitration ? Explain central arbitration scheme. Also, additional devices can be added anywhere by sharing the identical set of 


explain arbitration scheme using independent requests and grants, distributed arbitration lines. The drawback is its slowness in propagating the bus-grant 
árbüraion. signal along the daisy chain. Also, it is a fixed-priority sequence violating the 


è K hai A ; fairnes i 
Ans. Arbitration — Arbitration is the process of selecting the next bus fa eee 
master. The duration of a master’s control of the bus is known as bus tenure. the lower-priority devices on the right of the daisy chain cannot use 


t 2 . . . . . . . 
The-tenure ofthe bus.is restricted by this arbitrated process to one master at a bus when a higher-priority device fails. Bypassing a removed device or a 
ailing device on the daisy chain is preferred. Some new bus standards are 


a time. The arbitration of the competing requests depends on a 
. : a specified wi ili 
fairness basis. Bus transactions and arbitration competition may occ! with such a capability. 


concurrently on a parallel bus with separate lines for both purposes. Biss cnace f ; 
Central Arbitration — A central arbitration scheme using a central a Dee a bus-grant signal lines can be independently provided for each 
is shown in fig. 2.13 (a). In a cascade, potential masters are daisy-chair® i non aa i hk ike scheme, no daisy-chaining is used. The arbitration 
? ' a ill- : 
To propagate a bus-grant signal level from the first master (i.€-, at slot 1) “Hairs Based or = FES n still-performed by a central arbiter. Although; any 
the last master (i.e., at slot n),-a special signal line is used. A bus request 0 Priority-based bus allocation policy can be implemented. A 


Independent Requests and Grants — As shown in fig. 2.14 (a), multiple 


— UY 
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multiprocessor system mostly uses a fairness-based policy among the 
processors, and a priority-based policy for T/O transactions. l 

The processors may be assigned different functions, like serving as a 
front-end host, an executive processor or a back-end slave processor in some 
asymmetric multiprocessor architectures. In these cases, a priority policy can 
also be used among the processors. ees 

‘Flexibility and faster arbitration time are the benefits of using independent 
requests and grants in bus arbitration as compared to the daisy-chained 
policy. However, the use of large number of arbitration lines is the main 


limitation. 


Central Bus Arbiter 


Arbitration 
Number 


Bus 
Busy 


moe 
Sa 
es 
bitration | 
Number 


rbitration 
Number 


; 
Ar 


Data Transfer Bus 


TT 


"O | (b) Using Distributed Arbiters 
“Fig. 2.14 T wo Bus Arbitration Schemes 
Distributed Arbitration— Fig. 2.14 (b) shows the idea of using distributed 
arbiters. Each potential master has its own arbiter and a unique arbitration 
number. This unique arbitration number is employed to resolve the arbitration 
competition. In the situation, when more than one devices compete for the 
bus, the winner is the one whose arbitration number is the greater. 


i 
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To determine which device contains the largest arbitration number, parallel 

on arbitration is used. All potential masters can send their arbitration 
umbers to the shared-bus request/grant (SBRG) lines on the arbitration bus 
n meh their respective arbiters. On the SBRG lines each arbiter compares 
ne resulting number with its own arbitration number. The requester is dismissed 

en the SBRG number is higher. At the end, the winner’s arbitration number 
ains on the arbitration bus. The winner seizes control of the bus when the 
current bus transaction is completed. 

Obviously, the distributed arbitration policy is priority-based. Such a 
distributed arbitration scheme has been adopted by the Multibus II and the 
proposed Futurebust. In addition to distributed arbiters, the Futurebus+ 
standard also offers options for a separate central arbiter. 


contenti 


wh 
rem 


0.37. Write short note on transaction modes and interrupt mechanism 


of a bus. : 

Ans. Transaction Modes — A packet data transfer composed of an address 
transfer followed by a fixed-length block of data transfers (packet) from a set 
of contiguous addresses. A compelled data transfer composed of an address 
transfer followed by a block of-one or more data transfers to one or more 
contiguous addresses. An address-only transfer composed of ‘an address 
transfer followed by no data. , : 

The two classes of operations regularly performed .on a bus are data 
transfers and priority interrupts handling. A bus transaction composed of a 
request followed by a response. In a single bus transaction, a-connected 
transaction is used to perform a master’s request and a slave’s response. A 
split transaction splits the request and response into separate bus transactions. 
With a long data latency or access time, split transactions permit devices to 
use the bus resources in a more efficient way. A complete split transaction 
may need two or more connected bus transactions. In a large multiprocessor 


system, split transactions across multiple bus sequences are carried out to 
accomplish cache coherence. 


- Interrupt Mechanisms — A request from input/output or other peripherals 
to a processor is known as an interrupt for service or attention. To pass the 
Interrupt signals, a priority interrupt bus is employed. To serve as an interrupt 
handler, a functional module may be used. Priority interrupts are handled at 
various levels. The interrupter must give identification detail and status. The 
ME bus utilizes seven interrupt-request lines, for example. To handle multiple 

erupts, maximum seven interrupt handlers are used. 
aN a ar data bus lines on a time-sharing basis, interrupts can also be 
at the k y message passing. The saving of dedicated interrupt lines is obtained 

xpense of requiring some bus cycles for handling message-based 
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interrupts. Virtual interrupts is the use of time-shared data bus lines to implemen; 
interrupts. The Futurebust+ was proposed not to include dedicated interrupt 
lines because virtual interrupts can be effectively implemented with the data 
transfer bus. 


0.38. Discuss arbitration, transaction and interrupt w.r. to backplane 


bus system. (R.GPV, June 2011) 


Or 
Explain about arbitration, transaction and interrupt. (R.GP.V., Dec. 2014) 


Ans. Refer to Q.36 and Q.37. 


3b 3 


«NEAR PIPELINE PROCESSOR, NON- 
ees PROCESSOR — 


0.1. What do you understand by pipelining ? (R.GP.V., Dec. 2009) 


Ans. Pipelining is a technique of decomposing a sequential process into 
suboperations, with each subprocess being executed ina special dedicated segment 
that operates concurrently with the other segments. A pipeline can be visualized 
as a collection of processing segments through which binary information flows. 
Each segment performs partial processing dedicated by the way the task is 
partitioned. The result obtained from the computation in each segment is 
transferred to the next segment in the pipeline. The final result is obtained after 
the data have passed through all segments. The name “pipeline” implies a flow 
of information analogous to an industrial assembly line. It is the characteristic of 
pipeline that several computations can be in program in distinct segments at the 
same time. The overlapping of computation is made possible by associating a 
register with each segment in the pipeline. The registers provide isolation between 
each segment so that each can operate on distinct data simultaneously, 


Q.2. What is linear pipeline processor ? Discuss it’s different models. 


Ans. A cascade of processing stages which are linearly connected to 
perform a fixed function over a stream of data flowing from one end to the 
other is called as a linear pipeline processor. 


__ Asynchronous and Synchronous Models — A linear pipeline processor 

is buildup with k Processing stages. At the first stage S}, external inputs 

(operands) are fed into the pipeline. The processed results are passed from 

seco to stage S; , |, where i = 1, 2....., k — 1. The final result comes from 

T 3 Ipeline at the last stage S,. There are two categories of linear pipelines 
Odels on the basis of the control of data flow along the pipeline — 


© Asynchronous Model — Fig. 3.1 (a) shows that a handshaking 
iS used to control data flow between adjacent stages in an 
ous pipeline. When stage S; is ready to transmit, it sends a ready 
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signal to stage Sj+1. After stage Si+1 receives the incoming data, it returns an 


acknowledge signal to S;. 4 +d. 
Asynchronous pipelines are important 1n designing communication 


channels in message-passing multicomputers where pipelined wormhole routing 
is used. There may have a variable throughput rate ın asynchronous pipelines, 
There may be different amounts of delay in different stages. 


=>, Output 
Q 


eecece 


Input =p. 
Ready oecece EJ Ready 
Ack ecoooo wn Acknowledge 
; Signal (Ack) 
(a) An Asynchronous Pipeline Model 
Latch Latch Latch 
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Stage Delay Delay Period 
(b) A Synchronous Pipeline Model 


Time —— 
(Clock Cycles) 


Stages 


(c) Reservation Table 


Fig. 3.1 Linear Pipeline Units Models and the 
Corresponding Reservation Table 


(ii) Synchronous Model- Fig. 3.1 (b) shows a synchronous pipeline. 
Clocked latchés are used to interface between stages. The latches are composed 
of master-slave flip-flops, which can isolate inputs from outputs. All latches 
transfer data to the next stage simultaneously when a clock pulse arrives. The 
pipeline stages are combinational logic circuits. It is preferred to have 
approximately equal delays in all stages. These delays find the clock period 
and thus the speed of the pipeline. 
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A reservation table is used to specify the utilization pattern of successiy 
stages in 2 synchronous pipeline. The utilization follows the diagonal streamli i 

shown in fig. 3.1 (c) for a linear pipeline. This table is just a spa = 
depicting the precedence relationship in using the Dielne ce-fine 
cles are required to flow through the pipeline for a k-stage ‘ 
pipeline. Successive tasks or operations are begin one per cycle to M 
pipeline. Once the pipeline is filled up, one result comes from the pipeline for 
K additional cycle. Only if the successive tasks are independent of each 
other, this throughput is sustained. 


0.3. Describe the major characteristics of linear pipeline processor. 
7, Following are the major characteristics of a linear pipeline — 


clock cy 


Ans. 
(i) Clock Period — The logic circuitry in each stage S; contains a 
time delay represented by 1;. Suppose T; be the time delay of each interface 


TOL peri 3 1 i a : : Š 
latch. The clock period of a linear pipeline is given as follows + ' 
oe i aetna - = sie —— k` - arana aae - a 
C= max {tj}] +T] =Ty +t 


—_ se 


The reciprocal of the clock period is known as the frequency f = 1/t of 
pipeline processor. ta 


The space-time diagram can be drawn to show the overlapped operatio 
in a linear processor. Ideally, n tasks can be processed by a linear pipeline wie 
k stages in T, = k + (n — 1) clock periods, where k cycles are used to fill a 
the pipeline or to complete execution of the first task and n — 1 cycl p 
required to complete the rest of the n — 1 tasks. yos are 


m 


(ii) Clock Skewing — Ideally, it is expected that the clock pulses to 


ee em 


arrive at all stages (latches) at the same time. However, the same clock pulse 


may arrive at different stages with a time offset of S because of a problem 


known as clock skewing. Suppose t,,,, be the ti 
as ax b ime delay of the longest logi 
path within a stage and tmin be the shortest logic path within a res ck 


We must select Tm = tmax +S and d < tnin — S to avoid a race in two 


successive stages. These constraints translate i i 
nto the following b 
the clock period when clock skew takes effect — g bounds on 


d + tnax tS STX Tn + — S 


In ideal case S = 0, tmax = Tm and tmin = d: Thus, we have 


t=t,+d ~~ 


... St 


over an e ft cee — The speedup of a k-stage lineat-pipeline processor 
quivalent non-pipeline processor isdefinedds— §° 9 °° — 

T, n.k = 

Tk + (r 


The maxi 
m : a 
um speedup is S, — k, for n >> k: It'mearis that the maximum 


Speedup that A 3 F >>K 
a linear pipeline can offer is k, where k denotes the number ‘of 


S 
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stages in the pipe. 


(iv) Efficiency — A linear pipeline efficiency is measured bY the 


percentage of busy time-space spans over the total time-space span, which 


equals the sum of all busy and idle time-space spans. Suppose k, n, T be the 
the number of tasks and the clock period of a linear 


number of pipeline stages, s ani 
pipeline respectively. The pipeline efficiency is given by 
B n.k.t 
"> kfk.t+(n-1)1] 
soos Ss 
= ke(n-D 


Ifn > œ then q > 1 i.e., larger the number of tasks flowing through the 


pipeline, the better is its efficiency. 
(v) Throughput — It is the number of tasks that can be completeg 
by a pipeline per unif time. The throughput is defined as — 
n ul 


kt+(n-l)t T 
where n denotes the total number of tasks being processed during an 
observation period kt + (n — 1)t. Ideally, w = 1/t = f when n > 1. 


0.4. What is the optimal number of pipeline stages for a linear pipeline 


w= 


processor ? 
Ans. Suppose t be the total time needed for a nonpipelined sequential 


program of a given function. One requires a clock period of p = t/k + d, where 


d denotes the latch delay to 
execute the same program on a 
k-stage pipeline with an equal 
flow-through delay t. Therefore, 
the pipeline has a maximum 
throughput of f = 1/p = I/(t/k + 
d). The cost of total pipeline is 
roughly estimated by c + kh, 
where c covers the cost of all 
logic stages and h denotes the 
cost of each latch. A pipeline 
performance/cost ratio (PCR) is defined as follows — 


f 1 


Peak 


Performance 
Cost Ratio 


Number 
of Stages 


= 


kg (Optimal) 
Fig. 3.2 Optimal Number of Pipeline 
Stages for a Linear Pipeline Unit 


PCR = 


c+kh  (t/k+d)(c+kh) 


The PCR as a function of k is plotted in fig. 3.2. The peak of the PCR curve 
e stages — 


corresponds to an optimal choice for the number of desired pipelin 
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t.c 


MO Yah 


denotes the total flow-through delay of the pipeline. To get the opti 
the total stage cost c, the latch delay d, and the latch cost h ae 
e 


Prove that k-stage linear pipeline can be at most k-ti 
-pipelined serial processor. tines faster 
(R.GP.V., Dec. 2004, June 2007, Dec. 2008 
H , May 2018) 
Explain that the maximum speed-up of a pipeline is equal to its stages. 
A , i (R.GEV., Dec. 201 6) 
Consider the execution of m tasks using a k-stage pipeli : 
A ; pipeline. In th 
situation first task will be finished after k-clock (as there are k-stages) ae 
rest of the m — 1 tasks are shipped out at the rate of one task per pipeline 
clock. Thus, k + (m~ 1) clock periods are needed to complete m tasks using 


0.5. 


than that of non 


Ans. 


a kestage pipeline. Ifall m tasks are run without any overlap, mk clock periods ~ 


are required as each task has to pass through all k-stage. Therefore, speed 
gained by an k-stage pipeline is given as follows — = 
speedy BO) Se ee te inc overly. 7 
Number of clocks needed when tasks are overlapped in time 
km 
— k+m-] 
P(k) approaches k when m approaches infinity. This specifies that when 
a large number of tasks are performed using an k-stage pipeline, a k-fold 
enhancement in speed can be expected. ' i 


Q.6. Discuss non-linear pipeline processors 
— 


- eee To cede variable functions at different times, a dynamic pipeline 
AE es ah traditional linear pipelines are static pipelines as they are 
feedback ae functions. A dynamic pipeline permits feedforward and 
such a struct tions apart from the streamline connections. That is why, 
Te a is known as a non-linear pipeline. ng 
sim i : ee 
subfunctions in ene a given function into a sequence of linearly ordered 
Pipeline is Wna E A can However, function splitting’ in a dynamic 
e i F , say 
from streamline B ae Stages are interconnected with loops apart 
Fig. 3.3 ( l Aii 
es, a) Shows a A . . si nage 
sages in this pipeline multifunction dynamic pipeline. There are three 
0 feedback Santee Is a feedforward connection from S} to S} and 
rom S3 to S, and from S; to S; in addition to the 


Streamlin 
e connecti 
lons from S; to S, and from S, to S3. The scheduling of 
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successive events into the pipeline is made a nontrivial task by these feedfory 

and feedback connections. The output of the pipeline is not essentially a 
the last stage with these connections. In fact, one can use the same pipeline is 
evaluate different functions following different dataflow patterns. j 


Output X 


Stages 


(c) Reservation Table for 


(b) Reservation Table for 
Function Y 


Function X 
Fig. 3.3 A Dynamic Pipeline with Feedforward and Feedback 
Connections for Two Different Functions 


Reservation Tables — For a static linear pipeline, the reservation table is 
not important because dataflow follows a linear streamline. However, for a 
dynamic pipeline the reservation table is more interesting since a nonlinear 
pattern is used. Multiple reservation tables can be produced for the evaluation 
of various functions by providing a pipeline configuration. 

Fig. 3.3 (b) and (c) show two reservation tables corresponding to a| 
function X and a function Y, respectively. One reservation table specifies a | 
function evaluation. A single reservation table is used to specify a static pipeline, | 
while more than one reservation table are used to specify a dynamic pipleline. | 

For one function evaluation, each reservation table shows the time-space | 
flow of data through the pipeline. The different paths may be followed by| 
different functions on the reservation table. Anumber of pipeline configuration 
may be expressed by the same reservation table. There exists a many-to-many 
mapping between different pipeline configuration and different reservation 
tables. The evaluation time of a given function is the number of columns in 4 
reservation table. For example, fig. 3.3 (b) and (c) show that the function X 
needs eight clock cycles to evaluate, and function Y needs six cycles 

respectively. Each function evaluation corresponds to a pipeline initiation table. 
The same reservation table is used by all initiations to a static pipeline. In| 
contrast, a dynamic pipeline may permit different initiations to follow a mix of 


reservation tables. 
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The checkmarks in each row of the reservation tables correspond 
ae instants (cycles) that a particular stage will be used A to 
the ti - „Je checkmarks, which means repeated usage of the 
Jes. Contiguous checkmarks in a row speci — 
ver more than one cycle. Multiple checkmarks in 
iple stages are used in parallel during a particy 


TOW may 
me stage 
extended 
a column 
lar clock 


Differentiate between linear and non-linear Processors, 


Q.7. 
(R.GRV, Dec. 2006) 


Or 
D ifferentiate between linear pipeline processor and non-linear pipelin 
e 


processor. (R.GRV, June 201 5) 


Ans. Refer to Q.2 and Q.6. 


0.8. Discuss latency analysis for dynamic pipeline. 
Or 
With non-linear processors, explain pipelining with latency analysi: 
make use of relevant state diagrams whenever required. ah of 
(R.GP.V., Dec. 2016 
Ans. The latency is the number of time units (clock cycles) betw 
initiations ofa pipeline. Latency values are positive integers. A lat cere 
specifies that two initiations are separated by k clock cycles A erat ai 
cause a collision if two or more initiations use th hime Pama bh 
caussa oo e same pipeline stage at the 
a collision specifies resource conflicts 
Pipes. Therefore, all collisions must be 
of pipeline initiations. Some | 
Latencies causi isi 

te ausing collisions ar 
ae are forbidden latencies ir 
on X, as shown in fig. 3 


between two initiations in the 
See prevented in scheduling a sequence 
a pee cause collisions, and some do not 
e 7 ) 

nown as Jorbidden latencies. Latencies 2 
1 using the pipeline in 
4. 


fig. 3.3 to evaluate the 


Stages 
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Time — 


Stages 


(b) Collision with Scheduling Latency 5 
Fig. 3.4 Collisions with Forbidden Latencies 2 and 5 
ig. 3. 


s represented as X; in fig. 3.4. Initiations x 
with latency 2. These initiations collide in 


1 and 


ith initiation J 
The it” initiatio stage 


de in stage 2 at time 4 


X, colli oeb ther collisions are depicted in times 5, 6, 8, ..., etc, Fig. 
3 at time i eee patterns for latency 5, where X, and X are 
3.4 (b) show 


cles apart. Their first collision takes place at time 6, 
y . 


ires to check the distance between any two checkmarks in the 
One requires to tion table to detect a forbidden latency. For instance, 
w of the reserva ea mark and the second mark in row S, is 5 in fig. 
etween the j forbidden latency. Likewise, latencies 2, 4, 5, 
jaa specting the same reservation table. We find the 
ate action Y from the reservation table in fig. 33 
forbidden latencies between successive 


scheduled 5 clock c 


same ro 
the distance o 
3 (b), specityin 
aa ar all forbidden fr 
forbidden latencies 2 and 4 for i 
(c). A sequence of permissible no 
task initiations is a latency sequence. 


10 141 12 13 14 15 16 17 18 19 20 


JSEM 
PPT 
PCE 


(a) Latency Cycle (1, 8) = 1, 8, 1, 8, k le E 
with an Average Latency of 4. 
Cycle Repeats 


Stages 


7 8 9 10 1 12 13 14 15 16 17 
TRL PEER EEEE 
Pal ete Polet pe e 
ROCNASAAMAAA 
(b) Latency Cycle (3) = 3, 3, 3, 3, e» 
with an Average Latency of 3 


Stages 


oa 


Latency Cycle (6) = 6, 6, 6, 6, 
with an Average Latency of 6 


Fig. 3.5 A Dynamic Pipeline with Feedforward and Feedback 
Connections for Two Different F unctions 


A latency sequence that repeats the same subse 
indefinitely is known as a latency cycle. The latency cy 
ipeline in fig. 3.3 to calculate the function X without cau 
Pai in fig. 3.5. For instance, the latency cycle ( 
infinite latency sequence 1, 8, 1, 8, 1, 8, ...... This specifi 
initiations of new tasks are separated by one cycle 
alternately. 


oeeesoy 


quence (cycle) 
cles in using the 
sing a collision is 
l, 8) denotes the 
es that successive 
and eight cycles 


cycle the average latency of a latency cycle is determined. Therefore, the 
latency cycle (1, 8) has an average latency of (1 + 8)/2 = 4.5. A latency cycle. 
having only one latency value is known as a constant cycle. In fig. 3.5 (b) and 


(c), cycles (3) and (6) are both constant cycles. The constant cycle average 
latency is the latency itself. 


0.9. What is forbidden latency ? 


(R.GEBV., June 2015) 
Ans. Refer to Q.8. 


0.10. Discuss in brief pipeline 
non-linear Pipelining, 


Ans. Pipeline Through initiati 
put — The initiation rate or the average number 
of task Initiations 


aera ons per clock cycle is the Pipeline throughput. When N tasks 
; ated within n pipeline cycles, the initiation rate or pipeline through 

‘ ghput 
1S Measured. as N/n. This rate is dete 


throughput and Pipeline efficiency of 


————___—- 


M apt See y 
ak adap ed. I ler Ice, the i i i 


: Generally, the smaller the ada 
an be expected. Th 
Per cycle, when the 
Seedy cycle, Th 
Teduced to Í; 


pted MAL, the greater the throughput that 
e maximum achievable throughput is one task initiation 
MAL is 1 because | < MAL < the shortest latency of any 
e pipeline throughput becomes a fraction unless the MAL is 


llel. 


an La 


94 Advance Computer Architecture (VI-Sem) 


Pipeline Efficiency — The stage utilization is the percenta 
that each pipeline stage is used over a sufficiently long ser 
initiations. The pipeline efficiency is determined by the accumul 
all stage utilizations. There exists a relationship between the Pipelin 
and pipeline efficiency. A shorter latency cycle results in highe 
Higher efficiency specifies less idle time for pipeline sta 
measures are related with a function of the reservation t 

initiation cycle adopted. ` 
At the steady state in any acceptable initiation cycle at least 
the pipeline should be fully (100%) utilized. Otherwise, the pipeli 
has not been fully explored. In such situations, the initiation cycl 
optimal and another initiation cycle should be examined for enh 


‘Be of time 
les of task 
ated Tate of 
© throughput | 
r throughput, 
ges. The two | 
able and of the 


One stage of 

ne Capability : 

e may not be 

ancement. 
Q.11. Define latency and throughput of pipeline. 


- (R.GP.V., June 2016) ` 
Ans. Latency — Refer to Q.24 (Unit-I). i 


Pipeline Throughput — Refer to Q.10. ! 


0.12. Explain the following concepts for collision-free scheduling — 
(i) Collision vectors (ii) State diagrams (iii) Greedy cycles, 
Ans. When scheduling events in a pipeline, the primary objective is to | 
obtain the shortest average latency between initiations without causing collisions. 


(i) Collision Vectors — We can differentiate the set of permissible 
latencies from the set of forbidden latencies by examining the reservation 
table. The maximum forbidden latency is m < n— 1 for a reservation table with 
n columns. The permissible latency p should be minimum as possible. The 
selection is made in the range 1 <.p < m — 1. In ideal case, the value of 
permissible latency is 1. Theoretically, in a static pipeline, a latency of | can 
always be obtained which follows a linear reservation table as depicted in fig. 
3.1 (c). A collision vector can show the combined set of both permissible and 
forbidden latencies. It is an m-bit binary vector C = (CmCm-1:---C2C1). The 
value of C; = 0 if latency is permissible and C; = 1 if latency i causes a 
collision. The value of C,, = 1, corresponding to the maximum forbidden 
atency. . 
a collision vector Cy = (1011010) is achieved for function X, and 
: ation tables in fig. 3.3. It is clear 
Cy = (1010) for function Y for the reserv e Lat 
: bidden and latencies 6, 3 an 
from Cx, latencies 7, 5, 4 and 2 are for eens 
permissible. Likewise, latencies 4 and 2 are forbi 
permissible for function Y. 
_ (ü) State Diagrams 
a state diagram specifying the 


By using the collision vector, we can construct 
Her gsible state transitions among successive 


| 


{ 


nd latencies 3 and 1 are | 


| 


| 
| 
| 


| 
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The collision vector corresponds to the initial state of the pipeline at 
own as an initial collision vector. Suppose p be a permissible 
time } 35 | hin the range 1 Sp £ m — 1. The next state of the pipeline at time 
s d with the help of an m-bit right shift register as depicted in 
t+P (a). Initially, the initial collision vector C is loaded into the register. 


fig- oe register is shifted to the right. Each 1-bit shift results in an increase 
Then 
in the | 


pore a collision and thus the corresponding latency should be forbidden. 
re 


Logical 0 are fed from the left end of the shift register. Thus, the next 
tate after p shifts is achieved by bitwise-ORing the initial collision vector with 
ie shifted register contents. For instance, from the initial state C, = (101101 0), 


the next state (1111111) is approached after one right shift of the register and 
the next state (1011011) is approached after three shifts or six shifts. 


In the present state of a state diagram, the permissible and forbidden 
latencies are represented by 0’s and 1’s say at time t. The bitwise ORing of the 
shifted version of the present state with the initial collision vector is done to 
stop collisions from future initiations beginning at time t + 1 and onward. 
Therefore, the state diagram covers all permissible state transitions which 
prevent collisions. All latencies are permissible latencies when they are equal 
to or greater than m. This specifies that collisions can always be prevented 
when events are scheduled at suitable distance (with latencies of m”). From 
the pipeline throughput viewpoint, such long latencies are intolerable. 

A state diagram is achieved in fig. 3.6 for function X. Only three outgoing 
transitions are possible from the initial state 
(1011010), corresponding to the three 
permissible latencies 6, 3, and | in the initial 
collision vector. Likewise, from state 
(101 1011), one approaches the same state after 
either three shifts or six shifts. If the number 
of shifts is m + 1 or greater, all transitions are 
redirected back to the initial state. For example, 
after eight or more (denoted as 8*) shifts, the 
Next state must be the initial state, regardless 
Of which state the transition begins from. Fig. 
3.6 (c) shows a state diagram for the 
reservation table in fig. 3.3 (c) with the help 
ofa 4-bit shift register. When the initial collision 
vector is determined, the corresponding state 
diagram is uniquely determined. Various ee. 
reservation tables may result in the same or (@ State Transition Using an 
distinct initial collision vectors (s). n-bit Right Shift Register 


“0” Safe 
“1” Collision 


C1 


C2 


Initial 
Collision Vector 
aai 


f 
$ 
d 
E 


r 
a 
n 
T 


(Leon 


(b) State Diagram for Function X (c) State Diagram for Function Y 
Fig. 3.6 
This specifies that even distinct reservation tables ma 


state diagram. Although, distinct reservation tables ma 
collision vectors and hence distinct state diagrams. 


y generate the same 
y generate different 
iii) Greedy Cycles — We can find optimal latency cycles which result 
in the MAL using the state-diagram. From the state diagram, we can trace 
many infinitely latency cycles. From the state diagram shown in fig. 3.6 (b) 
(1, 8), (1, 8, 6, 8), (3), (6), (3, 8), (3, 6, 3)....., are legitimate cycles. Among 
these cycles, only simple cycles are of our interest. l 

A latency cycle in which each state seems only once is called a simple 
cycle. Only (3), (6), (8), (1, 8), (3, 8) and (6, 8) are simple cycles in the state 
diagram in fig. 3.6 (b). The cycle (1, 8, 6, 8) is not considered as simple cycle 
since it repeats the state (1011010) two times. Likewise, the cycle (3, 6, 3, 8, 
6) is not simple cycle as it goes through the state (1011011) thrice. 

Greedy cycles are some of the simple cycles. In a greedy cycle, all edges 
are made with minimum latencies from their respective beginning states. For 
instance, the cycles (1, 8) and (3) are greedy cycles in fig. 3.6 (b) and (1, 5) and 
(3) in fig. 3.6 (c). These cycles must first be simple and their average latencies 
must be less as compared to the other simple cycles. The greedy cycle (1, 8) 
has an average latency of (1 + 8)/2 = 4.5, which is less compared to the simple 
cycle (6, 8) = (6 + 8)/2 = 7. The greedy cycle (3) contain a constant latency 
which equals the MAL for evaluating function X without causing a collision. In 
fig. 3.6 (c), the MAL is 3, corresponding to either of the two greedy cycles. In 
the state diagrams, the minimum-latency edges are denoted with asterisks. At 

least one of the greedy cycles will results in the MAL. Hence, the collision-free 
scheduling of pipeline events is decreased to discover greedy cycles from the 
set of simple cycles. The final choice is the greedy cycle providing the MAL. 


0.13. Discuss the technique of pipeline schedule optimization. 

Ans. An optimization scheme depending on the MALis given below. = 
idea is to use noncompute delay stages into the original a tia This ie 
provide the modified reservation table, resulting in a new collision vector an 
an enhanced state diagram. The aim is to get an optimal latency cycle, which 
is absolutely the shortest. 


~~ 


— oo 
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72, Shar identified the following bounds on the MAL (Minimal Average _ 
In 19 te d by any control strategy on a statically reconfigured pipeline 


Lion given reservation table — 


éxecutin O The MAL is less than or equal to the average latency of any 


ycle in the state diagram. | 
(ii) The MAL is lower-bounded by the maximum number of 
heckmarks in any row of the reservation table. 

c 


(iii) The average latency of any greedy cycle is upper-bounded by 
the number of 1’s in the initial collision vector plus 1. This is also an upper 
bound on the MAL. 

These results imply that the optimal latency cycle must be chosen from 
one of the lowest greedy cycles. Although, a greedy cycle is not enough to | 
guarantee the optimality of the MAL. For example, the MAL for both function | 
X and function Y and has met the lower bound of 3 from their respective | 
reservation tables. 


In fig. 3.6 (b), the upper bound on the MAL for function X is equal to 
4+ 1 = 5, a rather loose bound. In contrast, fig. 3.6 (c) depicts a rather tight 
upper bound of 2 + 1 =3 on the MAL. Therefore, all greedy cycles for function 
Y provide the optimal latency value of 3, which cannot be further reduced. 

One require to discover the lower bound by modifying the reservation | 
table to optimize the MAL. The method is to decrease the maximum number ' { 
of checkmarks in any row. The modified reservation table must preserve the ° 
original function being evaluated. Patel and Davidson have recommended the + 


use of noncompute delay stages to enhance pipeline performance witha shorter 4 
MAL. Their technique is given below — 


greedy c 


Delay Insertion — The aim of delay insertion is to modify the reservation "Y 
table, obtaining a new collision vector. This results in a modified state diagram, fy 
which may generate greedy cycles satisfying the lower bound on the MAL. he 
Fig. 3.7 (a) shows a three-stage pipeline, specified by the reservation table in 
fig. 3.7 (b), before delay insertion. This table results in a collision vector C= ing 
(1011), corresponding to forbidden latencies 1, 2, and 4. The corresponding llel 
state diagram in fig. 3.7 (c), has only one self-reflecting state with a greedy Ari 
cycle of latency 3 equal to the MAL. The maximum number of checkmarks in 


any row is 2 depending on the given reservation table. Therefore, the MAL = 3 
achieved in fig. 3.7 (c) is not optimal. 


Output 


Ti 
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Time — 
19°42 as 3 54 5 
six | | E 
xX 


= 
Delay One Clock Cycle by Di i 
-l 


ysl f] [xf 


(b) Reservation Table and Operations being Delayed 


(c) State Diagram with MAL = 3 

Fig. 3.7 A Pipeline with a Minimum Average Latency of 3 

The insertion of 

a noncompute stage 
D, after stage S3 will 
delay both X, and X3 
operations one cycle 
beyond time 4. The 
insertion of another 
noncompute stage D3 
after the second 
usage of S, will delay 
the operation X, by 
another cycle. These 
delayed operations, as 
grouped in fig. 3.7 
(b), provide a new 
pipeline configuration 
in fig. 3.8 (a). Both 
delay elements D; and 
D, are inserted as 
extra stages, as 
depicted in fig. 3.8 (b) 
with an enlarged 
reservation table 


5 ; ith a Reduced = 
containing 3 + 2 = 5 (c) Modified State Diagram with a MAL =? 


rows and 5 +2=7 Fig. 3.8 Insertion of Two Delay Stages to Obtain 


l an Optimal MAL for the Pipeline in Fig. 3,7 
columns. 


Ces 


Sta 


Output 


Original 
Stages 


Delay 
Stages 


delay time t, = 5 ns. 
pipeline ? 


calculated in part (i) ? 
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DLL SLA EEE GALLE OEIC DEL bla pa ates 


~ NUMERICAL PROBLEMS 


eee INIT TL PIP DIEN EAL CLS OIE BELIEFS SERLIGA EN 


Prob.1. Consider the execution of a program of 15,000 instructions by 
linear pipeline processor with a clock rate of 25 MHz. Assume that the 
i 
i struction pipeline has five stages. F S, 
. (i) Calculate the speedup factor in using this pipeline to execute 
the program as compared with the use of an equivalent non-pipelined 
rocessor with an equal amount of flow through delay. 
(ii) Whatare the efficiency and throughput of this pipelined processor ? 
(R.GPV, June 2013) 


ea T nkt 
Sol. (i) Speedup factor (S,)= T; ket (a—r 
x nk _  15,000x5 
= k+(m-I) 5+(15,000- 1) 
z 200 24.99 Ans. 
15,004 
M ; S, 4.99 
(ii) Efficiency (E,) sie ae oa 0.99 Ans. 
= r m _ nf _ _ 15,000 x 25 
roughput (Hy) = k+(n-1) 5+(15000-1) ! 
_ 375000 _ 54 99 i 
~ 15004 44 fa 
Prob.2. The time delay of the four segments in the pipeline processors ` 
are as follows — 4 
tı = 50 ns, t, = 35 ns, t; = 95 ns and t4=45 ns. The interface register 


() How long would it take to add 100 pairs of numbers in the iz 
(ti) How can we reduce the total time to about one-half of the time 


R.GPV., Dec. 
Sol. (i) We know that, ( ec. 2009) 


Clock period, T = max (t,, t ty, t4) + t 


T = max(50, 35, 95, 45) +5 
T=95+5 


T 


= 100 ns (time for segment 3) 


He 


r 
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Time to add 100 pairs of numbers 


T,= k+(n—1)}T 
= {4 + (100 — 1)}100 
= (4 + 99)100 


= (103)100 = 10300 ns = 10.3 us 
(ii) Now, divide se 


Ans 
gment 3 into two segments of 50+5= 
45+5=50ns 


= 55 ns and 
D T=55ns,k=5 
{k + (n- I)}T= {5 + (100 — 1)}55 
= (5 + 99)55 
= (104)55 = 5720 ns = 5.72 us 


Ans, 
Prob.3. Consider the five stag. 


e pipelined processor Specified by the 
following reservation table — 123 45% 


(i) List the set of forbidden latencies and collision vector. 


(ii) Draw a state transition diagram showing all possible initial 
Sequences without causing a collision in the pipeline. 
(iii) List all the simple cycles. 
(iv) Identify the greedy cycles. 
(v) What is the MAL of this pipeline ? 
(R.GP.V., May 2018) 
Sol. (i) Forbidden latencies are 


(ii) The state transition diagram of this pipeline is drawn in fig. 3.9. 


Fig. 3.9 
(iii) The simple cycles are listed below — 
(1, 6), (3), (4, 6) 
(iv) The greedy cycles are (1, 6) and (3) 
(v) MALis 3. 


prob. 


qverage latency. 


throughput of this pipeline. 


| >» What are the forbidden latencies? 
fa Draw the state transition diagram. 
iii) List all simple cycles and greedy cycles. Pa 
o Determine the optimal constant latency cycle and the minimal 


Di 


Fig. 3.10 


(iii) The simple cycles are listed below — 
(2), (4), (1, 4), (1, 1, 4) and (2, 4). 
The greedy cycle are (1, 1, 4) and (2). 


(iv) Optimal constant latency cycle is (2) and the MAL = 2 


MAL is calculated as below — 
Simple Cycle 


Average Latency 
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ider the following pipeline reservation table — 
4. Cons 1 2 3 4 


(v) Let the pipeline clock period be T = 20 ns. Determine the 


| (R.GP.V., Dec. 2003, June 2010, Dec. 2014) 


Sol. (i) Forbidden latencies are calculated by checking the distance 
between any two checkmarks in the same row of the reservation table. Here, 
‘the forbidden latency is 3 with a collision vector (100). 


(ii) State transition diagram is shown in fig. 3.10. 


ing 


Nel. 


cers 
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(v) The pipeline clock period, T = 20 ng ‘der the following pipeline processor with four stages. All 


| p.6. Const . must be used in successive clock periods. 
Here, MAL=2 sen stages after each stage 1 Sua 
Thus, the pipeline produces each Output in 2 x 20 ns ig dias. suce? 
Throughput is the output of the pipeline in 1 sec. i.e., Psi 
1 1x10 


= =2.5 x 107 = 
40 x 107? 40 25 MIPS Ans 


Prob.5. A certain dynamic pipeline with the four se ; 
i gments § | 2, 3.12 
and S, is characterized by the ahead reservati he s ae 


ee ver the following questions associated with using this pipeline with 
anon time of eight pipeline clock periods — 

(i) Write down the reservation tables. 

(ii) Explain collision vectors and forbidden latencies. 

(iii) Explain the state diagram. ; 

j ji throughput of this pipeline. 

peer sah (R.GPV, June 2003) 

Sol. Here, the evaluation time of the pipeline is given as eight clock periods. 


an eva: 


(i) Determine latencies in the forbidden 


list F and the collision Thus, we have to construct a reservation table with eight colurnns. 
vetor C. (i) One possible reservation table of the above pipeline is drawn below — 
i (ii) Determine the minimum constant latency L by checking th l 2 3 4 5 6 7 8 
a forbidden list, Š 
a 
| iii) Draw the state diagram for this pipeline. Determine the i 
minimal average latency (MAL) and the maximum throughput of this S, 
pipeline. Stages 
(R.GEV, Dec. 2004, June 2008) S, i x xf f] 
Sol. (i) The forbidden latencies are 
5,4,2 Sele cilenBl sl a | dl 3 
The wee is Time > 
(ii) The minimum constant latency cycle is 3 here as it is the ao a Osa for the 
minimum latency cycle that keeps the pipeline state constant as before. pip TE aa i : 


(iii) The simple cycles are 
(6), (1, 6), (3, 6), (3) 


Collision vector of the above pipeline is 
6 
i The minimum average latency (MAL) aa 


(101010). 


f (iii) The state diagram of this 
lpeline is drawn in fig. 3.13. 


+ 


i © Micximum through put of the The simple cycles are 

1 ee i j (7), (1, 7)* (3, 7) (3, 5, 7) (5, 3)* (5) 
4 is gı assuming the clock period 573, 5, =. 

v 


ae pipeline T =r ns and the given pipeline The greedy cycles are 


(l, 7) and (5, 3). 


=F has the evaluation period of six cycles. 
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(iv) The maximum throughput of a linear Pipeline js 
frequency, which corresponds to one output result per clock pe 
the evaluation time of the given pipeline is eight clock Periods so 


equal to 
Nod. Sin 


l 


— 


8t 


T = Tt ns, then the maximum throughput of the given Pipeline is 


Prob.7. Consider the five stage pipelined processor 
following reservation table — 


(i) List the set of forbidden latencies and collision vector, 


(ii) Draw a state transition diagram showing all possible initid 
Sequences without causing collision in the pipeline. 


(iii) List all the simple cycles. 
- (iv) Identify the greedy cycles. 
~) What is MAL ? 


(vi) What will be the maximum throughput ? 


(R.GP.V., Dec. 2010 
Or 


Consider the five stage pipelined processor specified by the following 
reservation table — 


(i) List the set of forbidden latencies and collision vector. 
(ii) What is the minimum average latency (MAL) of this pipeline 
(iii) Draw a state transition diagram. 

(R.GPV, June 2014 


Specified by thi 


its 

Ce 

g 7 loċk iod here We an 

going to get one result per eight clock periods. If we assume the Clock peri i 
0 an 


| 
| 


| 
| 
| 
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- The forbidden latencies are 3, 4 and 5. The collision vector is 11100. 
s o The state transition diagram of this 
is drawn in fig. 3.14. 
(iii) The simple cycles are (2), ©), (1, 6), 
ay 2) and (1, 6) 
(iv) The greedy cycles are (2) and (1, 6). Cuno ) 
(v) The minimum average latency 


(MAL) is 2. 
(vi) 


pipeline 


Cit) 


Fig. 3.14 
The maximum throughput is 1/MAL 


1 


= > = 0.5 or 50% 


Prob.8. Consider the following reservation table for 4 stage pipeline 
with clock cycle P = 20 ns — 


(i) Whatare the forbidden latencies and initial collision vector ? 


(ti) Draw state transition diagram. A | 
(iii) Determine the MAL associated with the shortest greedy cycle. s. 
(iv) Determine the pipeline throughput corresponding to the MAL ed 
ind given P 
(R.GPV, June 2011) ory 
Or sify 
Consider the following reservation table for a four stage pipeline with a The 
lock cycle T = 20 ns — 
ving 
allel 


(i) Whatare the forbidden latencies and initial collision vector ? 
(ii) Draw state transition diagram, 


(iii) Determine greedy cycle and simple cycle 
(iv) Determine MAL. 


(R.GPV, June 2013) 
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i 


(i) Forbidden latency (ti) Greedy cycle 


; : à , dig 
iti ce, an instruction stream Is placed in a queue, waiting for deco 
iii) State transition diagram (iv) MAL. manner. Hen 


and processing by the execution segment. The instruction stream queuing 

mechanism offers an efficient method for decreasing the average access time 

to memory for reading instructions. The control unit begins the next instruction 

fetch phase whenever there is space in the FIFO buffer. The buffer acts like a 

queue from which control obtains the instructions for the execution unit. 
Apart from fetch and execute phases, computers with complex instruction 

require other phases to process an instruction completely. Generally, the computer 

needs to process each instruction with the following sequence of steps — 

(i) Fetch the instruction from memory 

(ii) Decode the instruction 

(iii) Compute the effective address 

(iv) Fetch the operands from memory 


Sol. (i) The forbidden latencies are 2,4, 5 
and 7. 


(ii) The greedy cycles are (3) and (1, 8). 


(iii) The state transition diagram is 
shown in fig. 3.16. 


_ _ Sol (i) The forbidden latencies are 1, 2 and 5 The ciated with such a scheme has one possible digression in 
initial collision vector is 10011. l 6t operations: hes n instruction may result a branch out of sequence. In such a 
il Pe that a 2 . š 
(ii) The state transition diagram is drawn in fi (r) a such 4 yY pipeline must be emptied and all the instructions which have 
3.15. Ig. LI, situation, 1 a memory after the branch instruction must be eliminated. 
(iii) The greedy cycle is 3 and so is the MAL Fig. 3,15 | been se „ider a computer with an instruction fetch unit and an r 
i ipeli i on ; i ipeli n fetc 
(iv) The pipeline throughput corresponding to th jon unit designed to give a two-segment pipeline. The instructio : - 
= 0.33 8 to the MAL is 1/3, a can be implemented by means of a first-in, first-out (FIFO) buffer. 
: 5 E n . s 
The pipeline throughput corresponding to the P j ee a type of unit which makes a queue instead of a stack. The control 
=05 mele ave nts the program counter and uses its address value to read consecutive 
Prob | i | Eae from memory, whenever the execution unit is not using memory. 
rob.9. Find the following for the given reservation table — e anioi are fed into the FIFO buffer to be run on a first-in, first-out 
ns 


(iv) The minimal average latency (MAL) (v) Execute the instruction j 
is 3. , (vi) Store the result in the suitable place. 
ee ee á 18 Sa 6 The instruction pipeline does not work at its maximum rate because of y 
VEEN s4 a0 Cop neue ta ne te some problems. Different segments may consume different ti t 
| __ INSTRUCTION PIPELINE DESIGN, MECHANISMS FOR he incoming information. Some segments are skipped for specifi operations, 2 
f INSTRUCTION PIPELINING, PIPELINE HAZARDS, DYNAMIC | As an example, a register mode instruction does not need an effective address : 
: NSTRUCTION SCHEDULING — SCORE BOARDING AND. ‘calculation. Two or more segments may require memory access at the same 
TOMASULO'S ALGORITHM BRANCH HANDLING TECHNIQUES time, causing one segment to wait till another is not finished with the memory. ng 


r rhaid 


Memory access conflicts are sometimes resolved by using two memory buses 
for accessing instructions and data in separate modules. In this way, an instruction 
word and a data word can be read simultaneously from two different modules. 
Ifthe instruction cycle is split into segments of equal duration, then design ofan 


0.14. How the pipeline processing is done in an instruction pipeline? 

Explain with timing diagram for four segment instruction pipeline. 
Or 
Explain the structures and operational E Ae A w instr. e instruction pipeline will be most efficient. The time that each step consumes to 
ipelines used i scalar RISC, superscatar ft ‘processors. ; à fs : . a. 

pipelines used in CISC, scalar s SUP (R.GRY, Dec. 2010) complete its function relies on the instruction and the way it is executed. 

Ans. The consecutive instructions are read from memory by an instruction 
pipeline whereas previous instructions are being executed n other segments. 
The instruction fetch and execute phases overlap here and perform simultaneous 


Four Segment Instruction Pipeline — 


llel. 


PET a on ana a 


Sa 


ET 
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Segment 1 


result into a processor SO that 
the instruction execution and 
storing of the result can be 
combined into one segment. This 


Now gonsider that the processor has separate instruction and data 
memories so that the operation in FI and FO can be carried out at the same | 
time. If a branch instruction is not present; each segment works on different | | 

ructions. Therefore, | 


Fetch Instruction 
from Memory 


4 
f 
f 
1 
f 
$i 
f 


decreases the instruction pipeline Decode nstruction inst 
into four segments. soa: iP gep 4, instruction 1 | 
. in step 4, instruction 1 is 7 i 
Fig. 3.17 shows the , peing executed in HHH + Hee i 
processing of the instruction ai: "i segment EX; the operand £ Re | 
cycle aig em ee four- ranch ? for instruction 2 is being ZJ H 
e me peres nere OE ETETE 
etch Operand ins is being jaf [rrfoafrofex] | | | [ [ [| [| | 


Segment 
egment 3 from Memory 


fas 
gecoded in segment DA; 1 loaro | 111111 | 


fetching an operand from 


memory in segment 3 at the time - f im 

i . P and instruction 4 is being i | 

of executing an instruction in Seanad E fetched fi 2 3 4 5 6 7 8 9 10 I 12 13 
: xec i i | 
segment 4. The effective address Se i ag saa Nee Step » | 
segment FI. . als | 

gm Fig. 3.18 Timing of Instruction Pipeline 
g 


can be computed for the third 
instruction in a separate 
arithmetic circuit and whenever 
the memory is available, the fourth 
and all subsequent instructions 
can be fetched and placed in an 
instruction FIFO. Therefore, 
upto four different instructions 
can be in progress of being 
processed at the same time and 


Ae is decoded in segment DA instep 4, the transfer from FI to DA of the 
other instructions is stopped till the branch instruction is not executed in step 6. 
In case the branch is taken, a new instruction is fetched in step 7. When the 


Handling 


Now, suppose that instruction 3 is a branch instruction. As soon as this ) | 
Interrupt j 
| 


panch is not taken, the instruction fetched previously in step 4 can be used. " 
en, the pipeline continues till a new branch instruction is not encountered. Y { 
e $ 


| In the pipeline, another delay may take place, if the EX segment requires j 


to store the result of the operation in the d i 
t de 
i da 3 ata memory while the FO segment | 
n operand. In such a case, segment FO must wait till segment 


EX has not been finished its operation. . 
Mi 


upto four suboperations in the ig, 3.17 Four-segment CPU Pipeline ; i 
instruction cycle can overlap. aiea 8 P : ka 5. Discuss the different types of prefetch buffers used in instruction , 
Once in a while, an instruction in the sequence may be a program contro pipelining. 
type which causes a branch out of normal sequence. In such a case, the pendin m — There are three types of buffers, which are utilized to match the ee 
operations in the last two segments are completed and all information stored! etch rate to the pipeline 
the instruction buffer is deleted. Then, the pipeline restarts from the new ae consumption rate. Fig. 3.19 Sequential Instructions 
Similarly, when acknowledged, an interrupt reques shows that a block of Indicated by Program Counter ` 


Instruction Pipeline 


consecutive instructions are 
fetched into a prefetch buffer 
in one memory access time. 
The block access can be 
j obtained employing inter- 
ffecti"4 leaved memory modules or 


stored in the program counter. 
will cause the pipeline to empty and begin again from a new address va 5 ) 


Fig, 3.18 shows the oper: t 
time is split into steps of equal duration. Four segments are represe 


axis, the c 
in the diagram with an abbreviated symbol. l 
FI fetches an instruction. 


(i) Segment i l 

i Segment D A decodes the instruction and computes the € 
employing a cache to reduce 
the effective memory-acce 


address. 
(iit) Segment FO fetches the operand. ee 


(iv) Segme 


Seq. Buffer 1 


Seq. Buffer 2 


Target Buffer 1 


Target Buffer 2 


Fetch Unit 


i Instruction from 
S Branched Locations 


nt EX executes the instruction. 
: ig. 3.19 Sequential and Target Buffers 


y ON 


48 
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i eos . l . . g 


target are | i ; 

sa aance a e of target buffers. Both buffers work in first-i 

Both sequential buffers a a used as the part of the pipeline as additi A SS 

a bench A target buffers are filled with instructi onal stages, 
iction. After the branch condition is hed due to a 


instructions are tak 
en from one of the two buffers, and instructions į th 
in the other 


buffer are elimin 
a f 
aer seat a use one buffer to feed instructions int 
to avoid a collision pice: : ae ctions from memory. The two a A 
x instructions flowing i ers altemate 
A loop buffer is a third type of into and out of the pipeline, 


instruction contained in a small oe buffer. This buffer keeps sequential 


a pair of 
a branch 


| oe The loop buffer works in EN 
the time of instruction fetch from ee a id sr ea eons 
Ge i aa ory. Second, it identifies when the ae 


ao undary. In this situation, if i i 

i loop buffer, unnecessary memory w ae peat 
16. Why multi i a, 
7 iple functional uni: | 
design ? How do they alleviate the deen 
instruction pipeline ? 


in a pipelined proce. 
a SSOP 
bottleneck in the execution stages of the 


Ans. Someti : 
times, there is the bottleneck in a certain pipeline stage. This Ų 


stage corres v hav 
onds t i i 
p o the row having the maximum number of checkma ks i 
I TKS in 


the reservati A 

ae TERA toa bottleneck problem can be resolved by usin i 

Br cn unites s age simultaneously. This results in the us 5 multiple 
in a pipelined processor design as shown in fig n ee 


Load 
Registers 


Reservation 
Station 


Functional 
Unit 


Instruction 


D 
Fetch Unit eedeaad 


Issue Units 


Reservation 


Functi 
Station ctional 


Unit 


Reservation 
Station 


Functional 


Register File Unit 


Reservation 
Station 


Functional 
Unit 


Fig. 3.20 A Pipeli 
à ipel 
; ‘Di Processor with Multiple Functi : 
istributed Reservation Sadons S al 
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odel architecture for a pipelined scalar processor 
nits. The reservation stations (RS) are used 


esolve data or resources dependences 


nal unit in order to r 
nstructions entering the pipeline. Operands can wait in 


the RS till its data dependences have not been resolved. Each RS contains a 
unique tag, which is monitored by a tag unit. The tags from all currently used 
registers Or RSs are checked by the tag unit. This register tagging technique 
permits the hardware to resolve conflicts between source and destination 


registers allocated for multiple instructions. Apart from resolving conflicts, 
the RSs also work as buffers to interface the pipelined functional units with 
e dependences are resolved the multiple 


the decode and issue units. Once th 
supposed to work in parallel. This will remove the bottleneck 


in the execution stages of the instruction pipeline. 


0.17. Describe the internal data forwarding terminology associated with 
(R.GP.V., Dec. 2004) 


pipeline computers. 


Fig. 3.20 shows a M 


containing multiple functional u 


with each functio 
among the successive ! 


Or 
Describe the internal data- orwarding techniques associated with pipeline 
computer and its operations. (R.GP.V., Dec. 2003) 
Ans. Internal forwarding is a “short circuit” technique for replacing 


unnecessary memory accesses by register-to-register transfers in a sequence 
ations. Memory access is much slower than 


of fetch-arithmetic-store oper 
register-to-register operations. The computer performance can be greatly 


improved if we can remove unnecessary memory accesses and combine some 
transitive or multiple fetch-store operations with faster register operations. 
Fig. 3.21 shows that the concept of internal data forwarding can be explored 
in three directions. 

The symbols M; and Rj are used to represent the ith word in the memory 
and j' register in the CPU respectively. The arrows (<) are used to specify 
data-moving operations like fetch, store and register-to-register transfer. The 
contents of M; and Rj are denoted by (M;) and R) respectively. 


(i) Store-fetch F orwarding — Fig. 3.21 (a) shows that the following 


sequence of the two operations store- hen-fetch is replaced by two parallel. 


operations, one store and one register transfer — 

M; < (R,) (store) 3 

| R, <(M;) (fetch) wo memory accesses 
being replaced by 

M; < (Rj) (store) 


3 Only one memory access 
R, < (Rj) (register transfer) y y 


8 
X 


p O 
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(ii) Fetch-fetch Forwarding — Fig. 3.21 (b) shows that the Ri 


two fetch operations are replaced by one fetch and one register transfer. 
one memory has been removed — 
R + (M;) (fetch) 
Ry <(M;) (fetch) 
being replaced by 
R, < (Mj) (fetch) 
R, < (Rj) (register transfer) 


Ry —> Mi yw Mi— r; 
| => ah 


R, k 
F 2 
(a) Store-fetch Forwarding (b) Fetch-fetch Forwa rding 


Ry ee 
Ry 
(c) Store-store Overwriting 
Fig. 3.21 Internal Forwarding Examples 
(iii) Store-store Overwriting — The following two memory updates 
(stores) of the similar word can be combined into one, because the second 
| Store overwrites the first [see fig. 3.21 (c)] - 


| 
M; +(R,) (store) : 
M; <(R,) (store) [ © ”° memory accesses 


Again 


Two memory accesses 


One memory access 


ae. 


being replaced by 

M;  (R,) (store) One memory access 
l The following example specifies how to apply internal forwarding to 
simplify a sequence of arithmetic and memory-access operations. Fig. 3.22 
shows these simplification steps where adjacent steps are combined to minimize 


_ memory references, Nodes in the graph represent the memory cells, registers, 
an adder or multiplier. 


Example ~ The inner loop of a specific program is completed to perform 
the following operations in a sequence — 


G) Roe (Mj) (fetch) 

(il) Ro < (Ry) + (MG) (add) 
(ili) Ro <— (Ry)* (M3) (multiply) 
(iv) M; & Rg (Store), 
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M2 
| M 
|» M2 Mi M2 
EE |a Ne fa 
N (A ‘i 2g ra Roe o+ 
Ro? j N Vs yor 
A Ao [7 7 
Rg 
n qai p 2 Forwarded 
1 and Step 2 Fo 
(a) Original Data Flow Sequence (b) Step 
M2 
1 Mı Mz 
o 
| Nb 
c> e+ 
3a a 
t— eR 
ci 
(c) Step 2 and Step 3 Forwarded 
M 
Mi z2 Mı a2 
1,2 
~ |a a |a 
$+ = o+ 
ec 2o 3a 
* 


(d) Step 3 and Step 4 Forwarded 
Fig. 3.22 Internal Data Forwarding 
We end up handling a compound function (macro instruction) 
M; < [(M,) + (M))]*(M3), as represented by the simplified data-flow graph 
in fig. 3.22 (d) after the internal forwarding. 


Q.18. What do you mean by instruction pipeline conflicts ? Explain in short. 
Or 
What are the major difficulties that cause the instruction pipeline to 
deviate from its normal operation and how they could be overcome ? 


hie. 


F~ 
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Ans. The instruction pipeline conflicts are the prob] 
ems 


instruction pipeline to deviate from its normal operation, Thies that Cang 
pipeline conflicts are as follows — major inst va 
p : Tet 
(i) Resource Conflicts — The access to memo Aetio 
at the same time causes these conflicts. By using separat ty by two Se 
memories these conflicts can be resolved. : Instruction ang a 
ss d 
l (ii) Data Dependency Conflicts — These conflic ata 
an instruction relies on the result of a previous instructi ts are caused 
yet available. on, but this Tesult When 
Is Not 


l difficulty that may result in a degradation of performance ; 
pipeline is owing to possible collision of data or addre nce in an instructi 
ae continue because previous instructions did not a Tf an instruction 
i ai i then a collision takes place. A data dependency ete some Specifi 
ai . 
r n n data are not yet available. As an See place When 
ent may need to fetch an i © an instructi 
the same ti SER . operand which is bei uction 
PR z the APFS instruction in segment EX 7 Peing &enerated at 
must wait for data to bec i els iig 
Likewise, an ad ome available by the first instru 
cannot be pala E poe ee ce Wiien an ae od aa 
c cause the informatio Ane addi 
1s not present. i i a needed By th i a 
menue ue es oe an Instruction with aed ind T ee bia 
operand if the previ i i P nnee can 
ohet previous instruction is loading fy 
gister. Therefore, the operand access to man. ae ie address 
must be delayed 


till the needed address is i 
not available. Such confli 
cts between data de 
ata dependencies 


The easiest technique is to i 
instructions ee So sert hardware interlocks. A circuit that d 
up in the pipeline, is anda i e ir instructions farter 
AA >, an Interlock. Detecti this situati ie 
auth , 10n of this s 
P A a mA n not present to be delayed A vk 
by usi ontlict. This approach maintains w 
y ree oe to insert the required delai ans AOR EEDE 
à in a D is operand forwarding th 
en prevent it by routing the da 


Pipeline segments. A 
a Ke re aa emps, rather than transferring an ALU result into a 
> ware checks the destination operand, and if it is needed I 
i 


at uses special hardware to detect 
ta through special paths between |. 


as a source in the next j 

b ; next instruction, it : : 
Y passing the register file. This A result directly into the ALU input, 
s 


multiplexers as wel additional hardware paths through ¢ 


ich : 

mines Sp the high-level programming language i 
- Compiler for such computers is designed to A 

{ 


detect a data 
conflict a 
nd reorder the instructions as necessary to delay the 
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the conflicting data by inserting no-operation instructions. This 


known as delayed load. 
ficulties — These difficul 


loading of 
technique is 
(iii) Branch Diff 
tructions that change t 
eof the major difficultie 
on may be conditional o 
rget address, an uncondi 
w, In a conditional branc 
s satisfied or the next seq 
nch instruction breaks th 
ties in the operation of 
dware techniques are used t 
n branching. 

prefetch the target 


ties arise from branch and 


he value of PC. While operating an instruction 
s is the occurrence of branch instructions. A 
r unconditional. By loading the program 
tional branch always changes the 
h, the control selects the target 
uential instruction if the 
e normal sequence of 
the instruction 
o reduce 


other ins 
pipeline, on 
branch instructi 
counter with the ta 
sequential program flo 
instruction if the condition i 
condition is not satisfied. The bra 
the instruction stream, causing difficul 
pipeline. In pipelined computers, various har 
the performance degradation caused by instructio 


One approach of handling a conditional branch is to 
instruction in addition to the instruction following the branch. Both are saved until 


the branch is executed. In case the branch condition is successful, the pipeline 
continues from the branch target instruction. An extension of this procedure is to 
continue fetching instructions from both places until the branch decision is taken. 
At that time, control selects the instruction stream of the correct program flow. 
Another approach is the use of a branch target buffer or BTB. The BTB is 
ative memory included in the fetch segment of the pipeline. In the BTB, 
s the address of a previously executed branch instruction and the 
r that branch. In addition, it also stores the next few 
ch target instruction. If the pipeline decodes a branch 
sociative memory BTB for the address of the 
instruction is available directly and 
truction is not in the BTB, 


an associ 
each entry ha 
target instruction fo 
instructions after the bran 
instruction, then it searches the as 
instruction. In case it is in the BTB, the 
prefetch continues from the new path. When the ins 
the pipeline shifts to a new instruction stream and stores the target instruction in 


the BTB. This scheme has the advantage that branch instructions which have 
occurred previously are readily available in the pipeline without interruption. 
The loop buffer, a variation of the BTB, is a small very high speed register 
file maintained by the instruction fetch segment of the pipeline. If a program 
loop is detected in the program, then it is stored in the loop buffer in its entirety, 
including all branches. The program loop can be executed directly without having 
to access memory until the loop mode is removed by the final branching out.’ 
In some computers, branch prediction procedure is used. A pipeline with 
branch prediction employs some additional logic to guess the outcome ofa 
conditional branch instruction before it is executed. Then, the pipeline starts 
prefetching the instruction stream from the predicted path A correct predicti 
eliminates the wasted time caused by branch penalties. i pi ae 


| 
Í 
f 
f 
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In most RISC processors, the delayed branch procedure is used. In thi 
procedure, the compiler detects the branch instructions and rearranges the 


machine language code sequence by inserting useful instructions that keep the 


pipeline operating without interruptions. 


Q.19. Explain the following approaches 


pipeline processor — 
© Branch elimination (ii) Branch prediction (iii) Branch target, 


(R.GPV., June 2011) 


to the branch Problem in 


Ans. Refer to Q.18. 


Q.20. Describe the necessary conditions for the hazards, 
(R.GPV, June 2003) 


Or 


What are the major pipeline hazards ? 
Or 


Write short note on major pipeline hazards. 
(R.GPV., June 2007, Dec. 2009) 


(R.GPRV., June 2004) 


Or 
What are the major hurdles of Pipelining ? 
ge conflicts among various 


are due to resource usage 
_instructions in the pipeline. These_hazards are friggcred by-interinstruction 
dependencies. Methods to resolve with data-dependent hazards are required in 
any type of lookahead processors for either synchronous pipeline or asynchronous 
pipeline. Another type of hazard is caused by a job scheduling problem known 


as collision. Interinstruction dependencies may arise to avoid the sequential data 
flow in the pipeline when successive instructions overlap their fetch, decode 
and execution through a pipeline processor. For example an instruction may 
relies on the results of previous instruction. The present instruction cannot be 
initiated into the pipeline until the completion of the previous instruction. In other 
instances, two stages of pipeline may require to update the same memory location. 
Hazards of this type, if not properly detected and resolved, could result in an 
interlock situation in the pipeline or generate unreliable results by overwriting. 
According to various data update patterns, there are three classes of data- 
dependent hazards — Write after read WAR) hazards, read after write (RAW) 
hazards and write after write (WAW) hazards. It should be noted that read- 
after-read does not create any problem, because nothing is altered. 
Resource objects are used to refer to working registers, memory locations, 
and special flags. The contents of these resource objects are known as data 
objects. Each instruction is considered a mapping from a set of data objects to 
a set of data objects. The range R(1) of an instruction | is the set of resource 


objects whose data object may be modified by the execution of instruction I. 


(R.GRV., Dec. 2015) 
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n instruction I is the set of resource. objects whose data 
—-~-cution of instruction I. Clearly, the operands to 


in D(D of a 
pe domain 2O) atts execution 0 l A 
taken (read) from its domain and the 


objects may influence t 
an instruction execution are 


be used in : eae 
esults will be stored (written) in its range. l , PEE OE 
; we consider the execution of the two instruction | and J ETEA 
ae J comes after instruction I in the program. eee nei 
Instruc yei tions between instructions I and J. The latency betweer ~ | 
n noticeable matter. Before or after the completion o A i 
instruction J may enter the execution pipe. tne | 


instructions is not very 
$ ; “ ie 
ee a a may pose some hazardous situations, 


imperfect timing and data dependencies | 
as depicted in fig. 3.23. | 


(c) WAR Hazard 


Fig. 3.23 Possible Hazard Conditions between Read and 
Write Operations in an Instruction Pipeline 


A RAW hazard between the two instructions I and J may happen when J 
attempts to read some data object that has been modified by I. A WAR hazard | 


may happen when J tries to modify some data objects that are read by I. AD 
WAW hazard may happen if both I and J try to modify the same data object. 
Formally, the necessary conditions for these hazards are as follows — 


R(D) A D(J) + 6, for RAW_ 
R(DAR() + 6, for WAW-—- 
D() A R(J) # 6, for WAR —- 


i N 
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21, ; 
oe pul 4 E to the mechanism for instruction pip elining explai 
a forwarding and hazards between read and write Aria es 
n. 


6: (R.GPV., May 2018) 


Ex ° . a 
a a internal data forwarding and possible hazard between read 
tons in the context of instruction pipelining. (R.GPV, D 
bi. Ans. Refer to Q.17 and Q.20. gy 


R.GPYV. 
Ans. Refer to Q.20 and Q.17. (R.GPV., June 2016) 


.23. i in i 
0.23. What is hazard ? Explain its types with suitable example 
(R.GP.V, Dec. 201 7) 


Ans. Refer to Q.18 and Q.20. 
Q.24. Explain the hazard detection and resolution. (R.GPV., June 2006) 


Or 
Write an essay on hazard d ? 
etection and resolution. (R.GP. 
7 2K, Dec. 2005) 


Write short note on hazard detection and resolution. 


———s 
— 
on 


. (R.GP.V, Dec. 200 

T E 3.1 lists the possible hazards for the four t Bnd 
e existence of possible haz 

detect the hazard and then to.resolve i nog 


á the incomin instructi i 
being processed in the pipe. Pe re witness Gene instructions 


ne 


Tabl iti 
able 3.1 Hazard Conditions for Various Instruction Types 


Instruction J 
(second) | Store Branch | Conditional Arithmetic and 


Type | T) 
Store type a ype | Branch Type| Load Type 
Branch type WAW 
Conditional branch type 


Arithmetic and load type 


sig al can b a rg Pimemoaiguam en ibe ap aaa 
ignal can be generated tö Prevent the hazard from occurring Another appre 
te hazar, n occurring. er approac 
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it the incoming instruction through the pipe and distribute the detection 
peline stages. This distributed approach provides better 
flexibility at the expense of increased hardware control. When a hazard is 
detected, the system should resolve the interlock situation. Suppose the 
instruction sequence E A A i P a NARSES lll 
which a hazard has been detected between a previous instruction I and the 
current instruction J. A simple approach is to stop the pipe and to suspend the 
ecution of instructions J, J + 1,.......- till the instruction I has not been 
he point of resource conflict. A modern approach is to suspend only 
and proceed with the flow of instruction J + 1, J + 2,....0+ 


tential hazards because of the suspension of J should be 
ction J + 1, J + 2,........... MOVE ahead of J. 


needing much more complicated 


is to perm 
to all the potential pi 


ex 
passed t 
instruction J 
down the pipe. The po 
continuously checked as instru 
Multilevel hazard detection may be found, 
control mechanisms to resolve a stack of hazards. 

IBM engineers developed a short-circuiting approach which gives a 
copy of the data object to be written directly to the instruction waiting to 
read the data in order to avoid RAW hazards. This concept was generalized 
into a scheme, called data forwarding, which forwards multiple copies of 
the data to as many waiting instructions as may want to read it. A data 
forwarding chain can be set-up in some cases. The internal forwarding and 
ques are advantageous in resolving logic hazards in 


Peck LM daea amm 


naasen e r- 


en -ea cam set ar DD 


pipelines. 
: 0.25. Explain possible data hazards with its resolving techniques. 
ove (R.GP.V., June 2014) 
Or 
Explain in detail the various pipeline hazards and methods to overcome. 
(R.GR.V, June 2017) 


Ans. Refer to Q.20 and Q.24. 


0.26. Describe static scheduling for scheduling instructions through 
an instruction pipeline. : 

Ans. The static scheduling is assisted by an optimizing compiler. Data 
dependences in a sequence of instructions pose interlocked relationships among 
them. Using a compiler-based static scheduling method, interlocking can be 
resolved. To enhance the separation between interlocked instructions a compiler 
or a post processor is utilized. 

The execution of following code fragment is considered. Untill the 
preceding load is completed, the multiply instruction cannot be started. Because 
two loads are overlaped by one cycles, this data dependence will stall the 
pipeline for three clock cycles. 


y 


qee t+ y 


U7 
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Ro Ry Ro — (Ro) + Ry 


M 
2 cycles a a eae 
2 cycles Load a o i ~ 
| , M(B) /R. 
a : | 3 3 < (Memo 
> cycles Multiply Rə, R3 /Rz & (Ry) x Ty 
2 3 


The two loads can be 
moved ahead to incre ; 

and th ig rs ase the spacin 
We aig instruction since they are independent a a them 
g e following program after this modification — and move. 


res Re ce 2 to 3 cycles 
39 ) 2 cycles due to ove i 
rlappin 
Ro, Ry 2 cycles or 
eG Ry, Rs 1 cycle 
ultiply R,, R3 3 cycles 


By using the above code ri 

4 earrangement, the data de 

sem pendenc 

a a preserved, and the multiply can be initiated a ie 
iak oi a add and move consume three cycles and thus pipeli ay. The 
; ed w ile the operands are being loaded from m Se 
into registers R, and R}. emory cells œ and B 


Bo | 0.27. Describe d ic i 
ae ; lynamic instruction sch 7 
| instructions through an instruction pipeline. renee di 
Discuss the difference Pennant T l 
scoreboard techniques of d 7 i Te 
lynamic scheduling. 1 
A g. (R.GPV., June 2014) 


What is dynamic instruction scheduling ? (R.GP.V., Dec. 20] 
GPV., Dec. 6) 


Ans. Dynamic scheduling i i 

; ma ing is obtained with , ; 

technique buil , with Tomasulo’s rı „tagpi 

built 1q ih uilt in the IBM 360/91 or with the help of the saree sea 
in the CDC 6600 processor. scoreboarding scheme 


(i) Tomasulo’s Algori 

| dependence- . gorithm — The implementation of thi 

| enya p se i scheme was first done with multiple ae oe 

Í platform, Three ana rid 3.20 showed the abstraction of Pola 

| floating-point ae ed in a floating-point adder irs-i 
ara- P for the model 91 processor. Using eee i sae 

boner e source and destination registers, thi esol 

icts as well as data depdi rs, this scheme resolves 


| An issued instructi 
‘ ction whose 
reservation station rel é operands are not present is pa A 
ated with the functional unit it will ARA Sie áli 


| 


\~ 


J 


r 
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d and its operands do not 
ed by watching the result 
instruction is dispatched 
rands for it are present. 
hen a source register is 
the tag for the source 
free, the tag can signal 


nces have not been determine 
The dependence is determin 
mon data bus in model 91. An 
it for execution when all ope 
register, which are tagged. W 
tion reaches the issue stage, 
o an RS. When the register is 


its data depende 
become available. 
bus known as com 
to the functional un 
There exist working 
busy and an instruc 
register is forwarded t 


the availability. 
Tomasulo’s algorithm was used to work with processors containing a 


few floating-point registers. Only four registers were available in case of model 
91. A minimum-register machine code for computing X = Y + ZandA=BxC 


is shown in fig. 3.24 (a). The pipeline timing with Tomasulo’s algorithm is 
is. the total execution time is 13 cycles, counting 


shown in fig. 3.24 (b). In this, 

from cycle 4 to cycle 15 by neglecting the pipeline startup and draining times. 
Memory is considered as a special functional unit. After completing the 

execution of an instruction, the result appears on the result bus. When a 

matching tag is found the registers as well as the RSs monitor the result bus 


and update their contents. 


Ry < Mem(Y) 
R2 + Mem(Z) 
R3 4+ (Ry)+(R2) 
Mem(x) < (R3) 
Ry + Mem(B) 
R3 + Mem(C) 
R3 < (R1)*(R2) 
Mem(A) < (R3) 
(a) Minimum-Register 
Machine Code 
Fig. 3.24 Dynamic Instruction Scheduling using Tomasulo’s 
Algorithm on the Processor in Fig. 3.20 


(ii) CDC Scoreboarding — An early high-performance computer was 
the CDC 6600 that used dynamic instruction scheduling hardware. Fig. 3.25 
(a) shows a CDC 6600-like processor where multiple functional units work as 
multiple execution pipelines. Parallel units permit instructions to complete out 
of the original program order. For each execution unit, the processor had 
instruction buffers. Available functional units get instructions regardless of 
whether register input data were available. The control i 
instruction would then wait in a buffer for its data to be generated by other 
instructions. The CDC 6600 used a centralized control units known as the 


(b) The Pipeline Schedule 


nformation of 


———= 
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scoreboard to control the correct routing of data between execution unj 
a ec arene the registers required by instructions cain a 
eae onal units. The scoreboard enabled the instruction 8 for 
gisters had valid data. Likewise, a functional unit si execution 
scoreboard to release the resources when it finished Pik Signale thp 


Fig. 3.25 (b) depicts the pipeli 

l pipeline schedule dependi 

pending on sc i 

P EE corresponds to the execution of the feet ae 

a a ar A latencies are similar to those i, 

fror i algorithm. Before t e add instructions regi aa 

issued to its function unit. Then, it waits for its input E E e “i 
nds. 


+ [excoutel = [Wriebact] 
| 
e 


(a) A CDC 6600-like Processor 


Instruction 


EEE] 
few] 


Ry < Mem(Y) 

R2 < Mem(Z) 

R3 < (Ry)+(Ry) 
Mem(x) + (R3) 

R4 + Mem(B) 

R5 <4 Mem(C) 

Rg © (R4)*(R5) 
Mem(A) (Rg) 


123456789 
10 11 12 13 14 15 16 17 
s . $ 
A : (b) The Pipeline Schedule > 
. 3. war ji 
e Scoreboarding for Dynamic Instruction Scheduli 
ing 


The scoreboa 
rd routes th i 

free. The iss ; e register values to th . 

; ue sta i e adder un 

bypass the Hoces EE in the mean time, so other fee, bas a 

the static software int i us, performance is improved in enous S 
erlocking. It needs 13 cycles to Berio a aha same as 

rm the operations 


0.28. Explain Ti 
om K ii 
asulo’s algorithm Jor dynamic instruction scheduli 
uling. 
Or (R.GPV Dec. 2010) 


R.G 
Pi BV., June 2012, Dec. 2014) 


| Write Tomasulo’s algorithm 
{ Explain Tor 
| nasulo’s algorithm, 


(R.GPK, June 201 6) 


x 


T eames naandaa 
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Or 
j i data hazards with dynamic scheduling 
Briefly explain how to overcome epee: 
using Tomasulo’s approach. z (R.GPV, June 2015, 


(R-GEV, May 2018) 


Write short note on Tomasulo’s algorithm. | 


Ans. Refer to Q.27. 
in branch handling in pipeline systems. 
Or 


0.29. Expla. 
dling techniques in context of instruction pipeline 
(aie adi on (R.GP.V., Dec. 2007) 


(R.GRV, June 2006) 


Explain b 
design. a 


Explain branch handling techniques in pipelin 
Or 


? Also write its advantages. 
(R-GP. V, June 2012) 


| 
(R.GRV., June 2010) | 


ing. 
( 


What is branch handling 


Or 
Describe about branch handling techniques. (R.GPK., Dec. 2014) 
Or 


(R.GPRV., June 2015) 


Explain branch handling techniques. 

Ans. Depending on branch code types statically or depending on branch 
history susie progi n execution, branch can_be predicted, For predicting 
branch, the pre “ability of branch with respect toa particular branch instruction 
type is utilized. This needs collecting the frequency and probabilities of branch 
types and bran. h takén across a great number of program traces. This type of 
stati branch su tegy may not be always precise. i ‘ 

Usually, the static prediction direction is wired into the proc 
best performance is given by predicting taken according to past experience. 
This comes from the fact that most conditional branch instructions are taken 
in program executi: 1. Ynce committed to the hardware, the wired-in static 
prediction cannot be changed. However, the scheme can be modified to permit 


the programmer or compiler to choose the direction of each branch on a semi- 
static prediction basis. i 

The recent branch history is used by a dynamic branch strategy to predict 
whether or not the branch will be taken next time when it takes place. To be 
precise, one may require to use the whole history of the branch to predict the 
future choice. This is not feasible to implement. Therefore, limited recent 
history is used for most dynamic prediction, as shown in fig. 3.26. 

The dynamic branch strategies have been.classified by.cragon into three 


{eee SV an Ta 
major classes— The first_class_ predicts the branch direction on the basis of 


information found at the decode stage. The second class uses a cache to store 


essor. The 


ss a 


y 
| 
| 
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target addresses at the stage the effective address of the branch target js 


ttt E 


calculated. The third class uses a cache to store target instructions at the fetch 


a TAT TT 


stage. All dynamic predictions are adjusted dynamically as a program is executed — 


Dynamic prediction needs additional hardware to keep track of the past 
behaviour of the branch instructions at run time. There should be small amount 
of history recorded. Otherwise, the implementation of prediction becomes too 


costly. 
N 


| Last Two Branches 


Branch Branch Branch Last Branch Taken 
Instruction Prediction Target and Previous Not 
Address Statistics Address Not Taken Taken 


| Dae Not Branch Taken Both Last Two 
Ba a and Previous Beanch Wak 
een ees eee Taken sha ad 

a) T 
(a) Branch Target Buffer Organization (b) State Diagram 


Fig. 3.26 
The use of a branch target buffer (BTB) to implement branch prediction 


(fig. 3.26 (a)) has been shown by Lee and Smith, The recent branch information 


including the address of the branch target used is stored by the BTB. The 
address of the branch instruction searches its entry in the BTB. 


For example, Lee and Smith have given a state transition diagram (fig. 
3.26 (b)) for backtracking the last two branches in a given program. The BTB 
entry has the backtracking information which will help in the prediction. On 
completion of the current branch, prediction information should be updated. 


The BTB can be extended to store the branch target address and the 
target instruction itself and a few of its successor instructions, in order to 
permit zero delay in converting conditional branches to unconditional branches. 
The taken (T) and not-taken (N) labels in the state diagram are analogous to 

{ actual program behaviour. Distinct programs may use distinct state diagrams 
t which are modified dynamically based on historical program events. 


t 
Q.30. What is the use of branch target buffer ? : 
(R.GPV., June 2015, Dec. 2015) 


Ans. Refer to Q.18 and Q.29. 


Q.31. Explain the effect of branching on pipeline performance. 


b anes branching effect on pipeline performance is discussed below, 
y instruction pipeline, composed of five segments — Instruction fetch, 


Unit - I 125 


s. Possible memory conflicts 
iently large cache memory 
he performance of an 
continuously by the 
type instructions do 
with sequential 
on per fixed 


xecute and store result 
are neglected and a suffic 
effect of branching on t 
instructions is executed 
ed fashion if branch- 
n the pipeline is filled up wil 
execution of one instructi 


erand fetch, € 
overlapped fetches 
_ 3.27 shows the 
eline. A stream of 
line in an overlapp 
such situations, whe 
peline completes the 


ins 


instructions, the p1 


latency. 


S1 
Fet 
Struction 


S5 


S2 


(a) 


2 
345678 9 10111213141516 171819 20 212 


012 


(b) Overlapped Execution of Instructions without Branching 


0123456789 10 1112 13141516 171819 20 2122 


j Time 


> u| | 
bb | | 
ih i 


A 15] | 
I6| | 
+ 7 | 


e 
e 
e 


(c) Instruction I, is a Branch Instruction 


i Fig. 3.27 Branching Effect on the Performance of an Instruction 
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In contrast, a branch instruction entering the pipeli 

down the pipe before a branch decision is sien Thee : halfiay 
counter to be loaded with the new address to which the pro ee TORTan 
ge RE me prefetched instructions useless. The neta uld te 
oe e until the completion of the current branch-instructi viet 
is resu ts in extra time delays in order to drain the pipeline. Th 100 Cycle 
pees is responded. Also, at the end of the branch cycle, the = ine ee 
rained. Due to the presence of a branch instruction the sai a risti 
instructions into the pipeline is thus temporarily interrupted pe flow of 
greater the percentage of branch-type instruction in a program the ee n 
? OWer 


program will run on a pipeline processor. This definitely does not deserve th 
© the 


concept of pipelining. 


0.32. What are the different way for branch prediction ? Discuss |, , 
S low 


pipeline performance issues can be reduced by branch prediction 
(R.GPV., Dec. 2017) 


Ans. Different Way for Branch Prediction — Refer to Q.29 
Pipeline Performance I Pr 
a. ssues can be Reduced by Branch Prediction ~ 


0.33. Explain the concept of delayed branching in instruction pipelines, 


ie eis eres en eee penalty, we realize that if the delay slot could 
ed « ortened to a zero penalty, the branch penalt 
decreased significantly. The aim of delayed branches is to ae i a 


as depicted in fig. 3.28. 
Delayed Branch 


1 2 © 4 5 6 


Granen) WEF Tale] S| 
1 Delay —— sfafefs)] 
Instruction 3 i Pf tated s | 

t : 


\ (Target) 
(a) A Delayed Branch for 2 Cycles when the Branch Condition is 
Resolved at the Decode Stage 


Delayed Branch 


2 Delay 
Instructions 


b) AD 
(b) A Delayed Branch for 3 Cycles when the Branch Condition 
is Resolved at the Execute Stage . 


‘ 


microins 


‘ 


v 
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Delayed Branch 


Instructions 


Time ——— 
ed Branch for4 Cycles when the Branch Condition 


is Resolved at the Store Stage 


yed Branch by Moving Independent 


e Concept of Dela 
Fillers into the Delay Slot of a Four-stage Pipeline 


Originally, the idea was used to decrease the branching penalty in coding 
tructions. A delayed branch of d cycles permits at most d—1 useful 
to be executed following the branch taken. The execution of these 
hould not depend on the result of the branch instruction. 
branching penalty cannot be obtained. 

s that used for software interlo 
ng to some program trace results, t 


()A Delay 


Fig. 3.28 Th 


Instructions or NOP 


instructions 
instructions 5 


Otherwise, 4 ZrO 


The technique is same a cking. NOPs can be 


he probability 


used as fillers if required. Accordi 
of moving one instruction (d= 2 in fig. 3.28 (a)) into the delay slot is |-igher than 
vo instructions (d= 3 in fig. 3.28 (b)) is about 0.2, and that 


0.6, that of moving 
of moving three instruct 


0.34. What do you un 
instruction in the instruction pip 


ions (d = 4 in fig. 3.28 (c)) is smaller than 0.1. 


derstand by delayed branch approach of jump 


eline discuss with suitable examples ? 
(R.GP.V., Dec. 2016) 


Delayed branch is a software procedure, used by the 
most RISC processors. In this procedure, the compiler finds the branch instructions 
inserting useful instructions 


and rearranges the machine language code sequence by 
insertion ofa no-operation 


that keep the pipeline operating without interruptions. The! 
f delayed branch. This causes 


instruction after a branch instruction is an example o 
the computer to fetch the target instruction during the execution of the no-operation 


instruction, permitting a continuous flow of the pipeline. 
Let us take an example 


Ans. Delayed Branch — 


Clock Cycles 
8 


1 2 eee 
Fetch [EXT] BITCLR| JUMP | 
Decode |__| FEXT_[BITCLR, 


Execute ii ees [ FEXT | 


Pipeling 
Stalling 


Fig. 3.29 Standard Branch Pipeline Operation 


.affected by JUMP Execute 
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Fig. 3.29 shows standard branch pipeline operation in which J 
instruction is followed by two no operation (NOP) instructions. In st UMp 
branch execution, when JUMP instruction is fetched from mema aa 
because the processor can’t predict the next statement to be ex "Y ang 


| ecut 

JUMP until the JUMP statement execution, it inserts NO Operations Non 
next instructions. This results the pipeline stalling i.e. wastage of clock din 
(in fig. 3.29 two clock cycle wasted). Cycle 


To overcome this situation, delayed branch approach is used. In this appr 
those instructions are Oach, 


fatched or decode Clock Cycles 


à 1 2 3 4 5 
during branch opera- 
tion of JUMP state- Fetch (JUMP | FexT | Brrcir| muL | 
ment which are not Doe JUMP | FEXT |BITCLR 


operation as shown in 
fig. 3.30. 


In this approach, insertion of NO Operation (NOP) is not required which 
reduces overall execution time and clock cycle. 


Fig. 3.30 Delayed Branch Pipeline Operation 


‘(ARITHMETIC PIPELINE DESIGN, STATIC ARITHMETIC 
PIPELINE, MULTIFUNCTIONAL ARITHMETIC PIPELINES, 
SUPERSCALAR PIPELINE DESIGN, SUPERPIPELINE 

PROCESSOR DESIGN 


WSR, 


eee how arithmetic pipelines are designed using pipelining. 
Or 


Explain arithmetic pipeline design for fixed point operations and floating 
point operations. 


Or : 
Explain static arithmetic pipelines. (R.GP.V., Dec. 2010) | 
Or 


Explain the working of arithmetic pipeline with suitable example. 
(R.GP.V., June 2011) 


Ans. Arithmetic Pipeline Stages — The different pipeline stages in an 
arithmetic unit need different hardware logic based on the function to be 


implemented. Because all arithmetic operations, like add, subtract, multiply, 


divide, squaring, square rooting, logarithm, etc. can be implemented with the 
basic add and shifting operations, the core arithmetic stages need some form 
of hardware to add or to shift. Using shift registers, arithmetic or logical shifts 
are easily carried out. High speed addition needs either the use of a carry-save 
adder (CSA) which add three input numbers and generate one sum output and 


(R.GPV., June 2010) , 


a carry output as shown in fig. 3.31 (b), or the use of a carry-propagation . 
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S 

Cout (Sum) 

(a) An n-bit Carry-propagate Adder (CPA) 
x Y Z 


Qa ai 


4) C=0111010 
) 101110 1=SP+C =Xt¥4Z C sb 


S= (Carry Vector) (Bitwise Sum) 
(b) An n-bit Carry-save Adder (CSA) 


Fig. 3.31 Distinction between a Carry-propagate Adder (CPA) 
r and a Carry-save Adder (CSA) 

dder (CPA) to add two numbers and generates an arithmetic sum as shown 
adder 
in fig. 3-31 (a). l B l j 

The carries produced in successive digits are permitted to propagate from 
the low end to the high end in a CPA, with the help of using some lookahead 
technique or ripple carry propagation. 

The carries are not permitted to propagate in a CSA but are saved in a 
carry vector. Generally, an n-bit CSA is defined as follows : Consider X, Y, and 
Z be three n-bit input numbers, represented as X = (ip Xp-2 9 Xp xo). To 
generate two n-bit output numbers, the CSA carries out bitwise cpereions 
simultaneously on all columns of digits. The output numbers are represente 
as Sb = (0, S,_ 1, Sp_o => Si So) and C = (C,, Cry Cı, 0). ‘vt 

It is noted that the tail bit of the carry vector Cc is always a : 
leading bit of the bitwise sum SÞ is always a 0. The input-output relationship 
are given below — 


S; = x; ® y; ® z (i) 
Cin = Xy; V Vii Y 7% 3 
for i = 0, 1, 2, ...... , n — 1, where ® denotes the exclusive OR and 7 pp 
the logical OR operation. It is noted that the arithmetic sum of t ae A 
numbers, i.e., S = X + Y + Z, is achieved by adding ihono onpi i sed to 
ie. S = Sb + C, with the help of a CPA. The CPA and CSAS Are & 
implement the pipeline stages of a fixed-point multip ly a a j ft ó 8-bit 
Multiply Pipeline Design — Consider the multiplication of tw 


; ision. This 
integers X x Y = Z, where Z is the 16-bit product in o Pfiatpreducts 
fixed-point multiplication is written as the summation of eignt p 


\ 


oo ee 


P 
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fi i | he partial product-Z. is achieved by multiply; 
as: Z = XxY = Zo + Zy Zot + Z,, where x and + denote ari . js noted that ‘hi if jbitstotheieh fe 
multiply and add operations, respectively. rithmetig | -cand X by the j bit of Y and ane p en ss basa T coui] bits to the left for 
lic ee 7. Thus Z is (8 + j) bits long with j trailing zeros. The 
i L2 the eight partial products is performed: with a Wallace tree of 
PA at the final stage, as depicted in fig. 3.32. 


products are produced by the first stage (S,) ranging from 
15-bits, simultaneously. The second stage (S>) is composed of two 
that combines eight numbers into four numbers ranging 
its. The third stage (S3) is composed of two CSAs, that 
ers into two 16-bit numbers. The final stage (S4) is a CPA, 
the last two numbers to generate the final product Z. 
CPA is estimated to require four gate levels of delay for a maximum 
16 bits. Each level of the CSA is implemented with a two-gate-level 
- The delay of the first stage (S,) also includes two gate levels. There is 

logic: The tely equal amount of delay in all the pipeline stages. The matching 
an PPO s is critical to the determination of the number of pipeline stages, 
of stage de pa clock period. If the delay of the CPA stage can be further 
is match that of a single CSA level, then the pipeline can be split into 
ix stages with a clock rate twice as fast. 
‘ Convergence Division — Division can be performed by repeated 

ultiplications. Mantissa division is performed by a convergence method. 
This convergence division achieve the quotient Q = M/D of two normalized 
fractions 0.5 < M <D <1 in two’s complement notation by carrying out two 
sequences of chain multiplications as follows — 
M x Ry x Rg X we X Rk | 


eS ae (ii) 


Dx Ry, x Rg XX Rk 
where the successive multipliers 
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ni-l 


R= 1+5% =2-D fori= 1,2,- kand D= 1-8 
The objective is to select R; such that the denominator DW =D R, X 
Ry X ae X Ry => | for a sufficient number of k iterations, and then the 


resulting numerator M * Ry * Rg * ~e * Rg > Q. , , 
It is noted that the multiplier R; can be achieved by finding the two's 


S3 


i-l 
complement of the previous chain product DW) =D x Ry X ne Rit =1- § 
because 2 — DÖ = R;. The reason behind DK — 1 for large k is that 

Dw = (1-8)(1 £8)(1482)(1+8")--1+8 

(1821 4+82)1 +84). (1+8 ) g 
(1-87) fori=1,2,K a) 
Since 0 <8 = 1 — D < 0.5, 82i — 0 because i is sufficiently large, Say, 
i=k for some k. Therefore D = 1 — 82k= 1 for large k. The ee result ‘ 

(lV 


2171 
) 
S4 


Il 


iei; ¥ os > 
Fig. 3.32 A Pipeline Unit for Fixed-point Multiplication of 8-bit Integers Q= Mx (1+8)x (I+ 54) x... x lt 
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The above two sequences of chain multiplications are 
between the numerator and denominator through the 
summarize, division is performed by repeated multiplica 
same hardware pipeline can be shared by divide and mul 


perfonmedalt 
Pipeline sta 
tions, There 
tiply, 
0.36, Discuss different pipeline designs for processors 


Bes, i 
fore, the 


RGPY, 
Ans, Refer to Q.14 and Q.35. RGPK, June 2012, 24) 
0.37. Distinguish between the following - 
(i) Arithmetic and instruction pipeline 
(ii) Unifunctional and multifunctional pipeline 
(iti) Static and dynamic pipeline 
(iv) Scalar and vector pipeline, 


6) 


| (R.GBV, June 2011) 
Ans, (i) Arithmetic and Instruction Pipeline - Refer to Q.35 and Q4 


(ii) Unifunctional and Multifunctional Pipeline - Static arithmetic 
pipelines are designed to performa fixed function and are thus called Unifunctiong 
pipeline. When a pipeline can perform more than one function, itis called 
multifunctional. A multifunctional pipeline can be either static or dynamic, 


iti) Static and Dynamic Pipeline - Static pipelines perform one 
function at a time, but different functions can be performed at different times, 
In a static pipeline, it is easy to partition a given function into a sequence of 
linearly ordered subfunctions. A dynamic pipelines allows several functions to 
be performed simultaneously through the pipeline, as long as there are no 
conflicts in the shared usage of pipeline stages, Function partitioning in a 
dynamic pipeline becomes quite involved because the pipeline stages are 
interconnected with loops in addition to streamline connections. 


(iv) Scalar and Vector Pipeline - Scalar and vector arithmetic 
pipelines differ mainly in the areas of register files and control mechanisms 
involved. Vector hardware pipelines are often built as add-on options to a 
scalar processor or as an attached processor driven by a control processor. 
Both scalar and vector processors are used in modern supercomputers. 


0.38, What is multifunctional arithmetic pipeline ? Discuss a static 
multifunctional pipeline designed into the TI advanced scientific computer (ASC). 


Or 
Explain multifunctional arithmetic pipelines. 


(RGPV, June 2013, 2016) 


Or 
Write a short note on multifunction arithmetic pipelines. 


(R.GPV, June 2017) 
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jonal Arithmetic Pipeline - When more than one 

formed by @ pipeline, it is known as multifunctional, A 
p n be of two types- static and dynamic. Static pipeline 
a time, but different functions can be performed 
fort 0! A dynamic pipeline can performs several functions 
i gh the pipeline, as long as there are no conflicts in the 
ine stages. A static multifunctional pipeline designed into 
pared Pie 4 scientific computer (ASC) is discussed below. 


metic Processor Design - Fig. 3.33 shows that four 

The TIASC ae into the TI-ASC system. The fetching and decoding 

sili? ste tanded by the instruction-processing unit. There are a large 

of pstructions ying registers in the processor. These registers controls the 

puber 0 fthe memory buffer unit and of the arithmetic units. There are two 

d buffers, {X, Y, Z} and {X', Y', Z}, in each arithmetic unit. X', 
sels 0f La ised for input operands. Z and Z are used to output results, 


Instruction Processing Unit (IPU) 


Instruction Index 


Buffer Registers 


Instruction, Vector 


Parameter 


Arithmetic Registers 


Registers 
Main 
Memory 


Memory 
Buffer 


Control 


Pipeline Arithmetic Units (PAU) 
Fig. 3.33 The Architecture of the TI Advanced Scientific Computer 


There are eight stages in each pipleline arithmetic unit as T 
fig. 3.34 (a). The pipeline arithmetic unit (PAU) isa static multifunction pipeline. 


OOO 
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Fig. 3.34 (a) shows all the possible interstage connections for Performi; 
arithmetic, logical, shifting and data conversion functions are shown Mi 
fig. 3.34 (a). This pipeline can perform both fixed and floating-point atithmet, 
functions. The PAU also supports vector besides scalar arithmetic operations 
Note that various functions need different interstage connection patterns and 
different pipeline stages. 
Fixed-point multiplication, for example, needs the use of only segments 
S}, Sg, S7 and Sg, as depicted in fig. 3.34 (b). In contrast, the floating-point 
dot product function carries out the dot product operation between two 
vectors, needs the use of all segments with the complicated connections as 
depicted in fig. 3.34 (c). This dot product was implemented by the following 
accumulated summation of a sequence of multiplications through the pipeline 
Z < A; x B; +Z. 
A,B 


| imon fs: 
E Exponent |S2 


Subtract 
Add 


k: 


R=f(A, B) RAAST R=DAixBi 
i=l 
(a) Pipeline Stages and (b) Fixed-point (c) Floating-point 
Interconnections Multiplication Dot Product 


Fig. 3.34 The Multiplication Arithmetic Pipeline of the TI Advanced 
Scientific Computer and the Interstage Connections of two 
Representative Functions 
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successive operands (A;, B;) are given to the X and Y-buffers 


a, the : 
Here, lated sums given to the Z-buffer respectively, 


and the accumu 

Phe whole pipeline can perform the multiply and the add ina single flow 
he pipeline. The loading and fetching of operands to or from the 
be isolate 
List the various pipeline design parameters for superscalar and 


through t 
PAU, will 
0.39. 
superpipeline 
Ans. Table 3.2 lists some parameters used in designing the scalar base 
chine and superscalar machines for four types of pipeline processors. It 
ma umed that all pipelines have k stages. The machine pipeline cycle for 
is a jar base machine is assumed to be | time unit, known as the base 
the k The maximum number of instructions that can be simultaneously 
oe ted in the pipeline is the instruction-level parallelism (ILP). All of these 
parameters have a value of 1 for the base machine. All machine types are 
designed relative to the base machine. The ILP is required to fully use a 


given pipeline machine. 


d by the two levels of buffer registers. 


processors. 


Table 3.2 Pipeline Processors Design Parameters 


Superpipelined 
Superscalar 
Machine of 

Degree (m, n) 


Scalar Base |Superscalar| Superpipelined 
Machine of k | Machine of | Machine of 
Pipeline Stages| Degree m Degree n 
pipeline 


1 
evele (base cycle) 
Instruction 
issue 
latency 
Instruction 
issue 1 
rate 
ILP to fully 
utilize the 1 
pipeline 


Simple 
operation 1 
latency 


Machine 
Type 


Machine 
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0.40. Describe the structure and performance of. Superscala 
Or r Process, r 
What is superscalar pipelined processors ? What factors limit ty, 
. te 


lar design ? degr 
of superscalar desig (R.GPV, Dec, 99° 


Or 2010) 


Draw superscalar pipeline design. (R.GPY, 


‘Ans. The instruction decoding and execution resources are increa 
form essentially m pipelines operating concurrently in an m-issue eee to 
processor. The functional units may be shared by multiple pipelines ae 
pipeline stages. A design example in fig. 3.35 shows this resource-shared ane 
pipeline structure. In this design, if there is no resource conflict and no i 
dependence problem then the processor can issue two instructions per Siel i 
the design, there are essentially two pipelines. Both pipelines contain fou, 
processing stages namely-fetch, decode, execute and store respectively, Each 
pipeline contains its own unit for fetch, decode and store. The two instruction 
streams passing through two pipelines come out from a single source of stream 
The fan-out from a single instruction stream is relate to data dependence 
relationship and resource constraints among the successive instructions. 


From Execute Stage 
D-cache Multiplier Store 
Decode (Writeback) 


» June 2012) 


From 
I-cache 


on Load 
Fig. 3.35 A Dual-pipeline, Superscalar Processor 


It is assumed for simplicity that each pipeline stage needs one cycle, 
except the execute stage that need a variable number of cycles. Execute stage 
uses four different functional units, adder, multiplier, load unit and logic unit. 
The adder has two pipeline stages, the multiplier has three pipeline stages and 
the others each contain only one stage. S, and S, are the two store units that 
can be dynamically utilized by the two pipelines, based on availability at a 
particular cycle. A lookahead window exists with its own fetch and decoding 

logic. In case out-of-order instruction issue is required to obtain better pipeline 
throughput, this window is used for instruction lookahead. 
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Performance — We estimate the ideal execution time of N 
instructions through the pipeline to compare the relative 
e between superscalar and a scalar base machine. The time needed 
ase machine is — 

TU, =k+N-1 (base cycles) 
| execution time needed by an m-issue superscalar machine is — 

T(m, 1)= k+ ima 

, m 

e time needed to execute the first m instructions using the m 
eously and the second term analogous to the time needed to 
e the left N- m instructions, m per cycle, using m pipelines. 


The idea 


(base cycles) 


TLD __ N+k-1 ~.m(N+k=)) 
T(m, 1) (N/m)+k—-1 N+m(k-!) 
Since N > %, the speedup limit S(m, 1) —> m, as expected. 


S(m, 1) = 


0.41. Explain data path and its control in detail.(R.GP.V., Dec. 2017) 


Ans. In fig. 3.36 , all the registers and arithmetic and logic unifare 
interconnected through a single common bus which is internal to the processor. 
External bus connects the processor to the memory and input/output devices is 
illustrated in fig. 3.36. The address and data lines of the external memory bus 
are connected to the internal processor bus through the memory address register 
(MAR) and the memory data register (MDR) respectively. Two inputs and two 
outputs are contained by MDR. Either from the internal processor bus or from 
the memory bus, data can be loaded into MDR. In MDR, data stored can be kept 
on either memory bus or on internal bus. The MAR input and MAR output are 
connected to the internal bus and external bus. Memory bus control lines are 
connected to the control logic block and instruction decoder. This is responsible 
for issuing the signals. These signals control the operation of all the units inside 
the processor and for interacting with the memory bus. From one to another 
processor, the number and use of the processor registers RO through R(n — 1) 
change. Registers can be given for general purpose use by the developer. Some 
can be dedicated as special-purpose registers like stack pointer or index register. 
Three registers Y, Z and TEMP are transparent to the programmer means the 
programmer require not be concerned with them since they are never referenced 
explicitly by any instruction. During execution of some instructions, by the 
processor for temporary storage. These registers are not used for storing data 
which is produced by one instruction for future use by another instruction. To 
given as input A of the ALU, the multiplexer (MUX) selects either a constant 
value 4 or the output of register Y. To increment the contents of the program 
counter, the constant 4 is used. For selecting the constant 4 or register Y, the 
two possible values of the MUX control input select as select4 and selectY. As 
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cution progresses, data are sent from one register to ą 

a the ALU to perform some arithmetic or logic ones 
control logic unit and instruction decoder is responsible for implement 
actions defined by the instruction loaded in the IR register. The contro] 
are produced by the decoder. These control signals are required to ch 
ers included and direct the transfer of data. The ALU, registers 
ctively known as the datapath. 


instruction exe 
often passing V1 


Other, 
n. The 
Ing the 
Signals 
j oice the 

regist k 
interconnecting bus are colle d the 
Internal Processor 

Bus 


Control Signals 
4 b 


Address 


Li 
MAR ines 


Instruction 
Decoder and 
Control Logic 


Memory 
Bus 


p=.--.--- 


ALU 
Control 
Lines 


XOR 


Fig. 3.36 Datapath 


To make it suitable for pipelined execution, the three-bus stru 
modified as represented in fig. 3.37 to support a four-stage pipel 
resource used in stages W and D are represented by solid line and thos 


e used 


3 i s he 
in stages E and F are represented by dash line. Operations 1 the data ca? 
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Eo aie which is based on the implementation 
d addressing moo This part is represented in dash line. Several 
difications are given below — 


Register 


i 


Se 
> 
Cc 
æ 
Z 

a 
i 
1 
' 
' 
7 


| incrementer | 


pr1eae-----9 


IMAR ! 


Gencageccnuel 


T 
Memory Address 


(Instruction Fetches) 


Control Signal Pipeline 


Instruction 


1 

1 

Li 

1 

1 

L] 

Li 

1 

1 

Di 

1 

1 

1 

1 

1 

I 

L 

a 

1 

r 

a 

Decoder 1 
1 
r 
1 
a 
I 
1 
= 


Cte R e" 
e MDR/Write i r DMAR H MDR/Read 
a E eee poses 

Instruction Cache i t : 

1 

i Memory Address I 

H (Data Access) ' 

Ms 2 ke ip wre oe a EEE E see 

1 
Data Cache 


Fig. 3.37 Datapath Modified for Pipelined Execution with Interstage 
Buffers of the Input and Output of the ALU 

(i) There are separate data and instruction caches which use separate 
data and address connections to the processor. This needs IMAR and DMAR 
versions of the MAR register, DMAR for accessing the data cache and IMAR 
for accessing the instruction cache. 

(ii) The PC is directly connected to the IMAR. Hence, PC contents 
can be sent to IMAR while independent ALU operation is performing. 

(iii) The data address in DMAR can be obtained directly from the 


register file or from the ALU to support the register indirect and indexed 
addressing modes. í auki 


| 
! 
l 
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(iv) Separate memory data registers are given for write and 
operations. Data can be transferred directly between memory data re Tead 
and the register file during store and load operations without the re Bisters 
pass via the ALU. quire to 

(v) At the input and output of ALU, buffer registers SRC1, SRC? 
RSLT have been introduced. Forwarding connections are not included in rh 
figure, they can be added if desired. Is 

(vi) The instruction queue has replaced an instruction register, Th 
instruction queue is loaded from the instruction cache. i 

(vii) The instruction decoder output is connected to the contro] signal 
pipeline. This pipeline uses the buffers B2 and B3 to store the control signals 


The operations can be performed independently in the processor are given 


below — 
(i) From the instruction cache, reading an instruction. 
(ii) Incrementing the PC. 
(iii) Decoding an instruction. 
(iv) Writing into or reading from the data cache. 
(v) From the register file, reading the contents of upto two registers. 
(vi) In the register file, writing into one register. 
(vii) Performing an ALU operation. 

These operations can be performed simultaneously in any combination 
because they do not use any shared resources.The flexibility needed to 
implement the four-stage pipeline, is given by this structure. For example 1}, 
h» |, and I, be a sequence of four instructions. During clock cycle 4, the 
actions all perform are given below — 

(i) Write the result of instruction I, into the register file. 

(ii) From the register file, read the operands of instruction l. 
(iii) Decode instruction l}. 

(iv) Fetch instruction Ij. 

(v) Increment the PC. 


0.42. Write short note on data dependence problem. 


Ans. For a program in fig. 3.38, the relationship among the instructions 
are indicated by drawing a dependence graph. We have flow dependence I, > 
l, because the register content in R, is loaded by I, and then used by l. 
Because the result in register R4 after executing 14 may influence the operand 
register R, used by l}, we have antidependence I, +> I4. Because both I, and 
I, modify the register Rg and Rg provides an operand for Iç, both flow and 
output dependence — I, — I, and I, +> I, as shown in the dependence graph. 
These data dependences must not be violated to schedule instructions through 
‘one or more pipelines. Otherwise, erroneous results may be generated. These 


Unit -I 444 


nces are detected by a compiler detects these data d i 
data depende ijable at pipeline scheduling time. ier 


LA [R] <— Memo (A)/ 
I Load Ri Ry 7Rz < (R2) + Ry)/ 


e R2: 

Br Add R3, R4 /R3 (Rs) + (RQ) 

O| 13 Mul Re RS [R4 < (Ra) * (Rs 

4 R6)/ Flow i 

P Comp Re /Ro & (R6 Anti- Output- 

2 A Mul Ré R7 iR¢ < (Re) *(R7)/ Dependence Dependence Dependence, 

pa also Flow 
Dependence 


Fig. 3.38 A Sample Program for Parallel Execution 


0.43. What do you understand by pipeline stalling ? Explain with suitable 
example the conditions causing pipeline stalling. 
“Ans. A problem which may seriously lower pipeline utilization is known as 
peline stalling. Proper scheduling prevents pipeline stalling. Both scalar and 
superscalar processors have this problem. Although, it is more severe in a 
superscalar pipeline. The data dependences or resource conflicts among 
instructions already in the pipeline or about to enter the pipeline can cause stalling. 
Example — Suppose the scheduling of two instruction pipelines in a two- 


issue superscalar processor. Fig. 3.39 (a) shows the case of flow dependence 
> J, on the right and no data dependence on the left. Without data 


dependence, all pipeline stages are used without idling. 


Instructions 
Instructions 


2 3 4 5 6 7 8 Time 
(Iz uses data generated by Ij) 


12 3 4 
(No data dependence) 


(a) Data Dependence Stalls the Second Pipeline in Shaded Cycles 


Instructions 


123 4 5 67 8 9 1011 Time 


(b) Branch Instruction I ' Causes a Delay Slot of Length 4 in Both Pipelines 


( 
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Instructions 
m 
w 
Instructions 


123 45 6 7 8 Time 


(ly and Iz conflict in using the same functi 
unit and 14 uses data generated by ao 


(c) Resource Conflicts and Data Dependences Cause the Stalling of 
Pipeline Operations for Some Cycles 


Fig. 3.39 Dependences and Resource Conflicts May Stall One or Two 
Pipelines ina Two-issue Superscalar Processor 


12 3 4 5 6 Time 
(No resource conflicts) 


With dependence, before entering the execution stages instruction | 
entering the second pipeline must wait for two cycles. Also this delay fay 
forwarded to the next instruction 14 that enters the pipeline. The consequence 
of branching (instruction I,) is shown in fig. 3.39 (b). A branch taken by 1, at 
cycle 5 results in a delay slot of four cycles. Hence, before the target instructions 
I, and I, entered the pipelines from cycle 6, both pipelines must be flushed. 
Here, delayed branch or other amending operations are not considered, A 
combined problem that includes both resource conflicts and data dependence 
is shown in fig. 3.39 (c). The similar functional unit is required by instructions 
I, and l, and I, > Iy exists. 

Since the two pipeline stages, €; and e,, of the same functional unit must 
be used by I, and I, in an overlapped manner, the net effect is that 1, must be 
scheduled one cycle behind. 1, is also delayed by one cycle for the same 
reason. Instruction I4 is delayed for two cycles because of the flow dependence 
on I,. The shaded boxes in all the timing charts are analogous to idle stages. 


0.44. Explain multipipeline scheduling in superscalar processor. 


Ans. There are critical instruction issue and completion policies to 
superscalar processor performance. Three scheduling policies are discussed 
below. When instructions are issued in program order, it is known as in-order 
issue. When program order is violated, it is known as out-of-order issue. 

Likewise, if the instructions must be completed in program order, it is 
known as in-order completion. Otherwise, it is called as out-of-order completion. 
In-order issue may result in either in-order or out-of-order completion. In- 
order issue is easy to implement but may not provide the optimal performance. 

Out-of-order issue usually finishes with out-of-order completion. The 
aim of out-of-order issue and completion is to enhance performance. Fig. 
3.40 shows these three scheduling policies by execution of the example program 
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e dual-pipeline hardware in fig. 3.35. It is shown th 

nhanced from an in-order to an out-of-order TETE 
arformane is often represented by the total execution time and the 
of pipeline stages. All programs cannot be scheduled out-of- 
jnts are imposed by data dependence and resource conflicts 


_ A schedule for six instructions being issued in program 
is shown in fig. 3.40 (a). Instructions I}, I, and I, are 
ine 1 and instructions I,, I4, and I; are received by pipeli 2 
ive cycles. I, has to wait one cycle to use the data loaded in 


der used by L, I, is delayed one cycle. Before entering the 
I, has to wait for the result of I.. I, is forced to wait for two 
ome out of pipeline 1 in order to maintain in-order completion. Overall, 


of-order completion is permitted even if in-order issue is used as 
g. 3.40 (b). The out-of-order schedule and the in-order schedule is 
way the I, is permitted to complete ahead of 1, and I4, that are 
dependent of I;. The pipeline utilization rate does improve. 
Although, the execution time does not. Only three idle cycles were noticed. 
The window may be used for reordering the instruction issues in order to 


reduce the total execution time. 


Out- 
shown in fi 
differ in the 
completely in 


12 3 4 5 6 7 8 9 Time 
(a) In-order Issue with in-order Completion in Nine Cycles 


a aie 
4567 8 9 
12 3 4 5 6 7 8 9 Time Completion Order 


(b) In-order Issue and Out-of-order Completion in Nine Cycles 


Pipe 2, Ip 
Pipe 2, l} 


Pipe 1, 16 |__| fa [a1 [mi fm2|m3]s2] 
Pipe 1, I5 [f3 |as [e1 fs1] Lookahead Window 


Pipe 2 ta [tr [1] Karone 
Pipe 1 ana 


12 3 4 5 6 7 Time 


.(c) Out-of-order Issue and Out-of-order Completion in Seven Cycles usin 
an Instruction Lookahead Window in the Recoding Process j 
Fig. 3.40 Instruction Issue and Completion Policies for a Superscalar 
Processor with and without Instruction Lookahead Support 


Out-of-order Issue = Instruction I; can be decoded in advance by using 
the lookahead window as it is independent of all the other instructions. The six 
instructions are issued in three cycles as shown- l, and I, are decoded 
concurrently, while I, is fetched and decoded by the window. It is followed 
by issuing I, and I, at cycle 2, and I, at cycle 3. The completion is also out of 
order as depicted in fig. 3.40 (c) because the issue is out-of-order. Now, there 
is decrement in total execution time to seven cycles having no idle stages 
during the execution of these six instructions. 

The simplest one to implement is the in-order issue and completion. It is 
rarely used today even in a conventional scalar processor because of some 
unnecessary delays in maintaining program order. Although, this policy is still 
attractive in a multiprocessor environment. Permitting out-of-order completion 
is found in both scalar and superscalar processors. 


Q.45. Describe superpipelined processor and superpipelined superscalar 
processor with their architecture and performance. 


Ans. Superpipeline Design — The pipeline cycle time is 1/n of the base 
cycle in a superpipelined processor of degree n. As a comparison, the same 
operation takes n short cycles ina superpipelined processor implemented with 
the same technology while a fixed-point add takes one cycle in the base scalar 
processor. The execution of instructions with a superpipelined machine of 
degree n = 3 is shown in fig. 3.41 (a). A single instruction is issued to each 
cycle in this case, but the cycle time is one-third of the base cycle. Single- 
operation latency is n pipeline cycles, which is equal to one base cycle. 

Superpipelined machines have been around for a long time. Cray has 
made both the CDC 760C and the Cray 1 as superpipelined machines with a 
latency of n = 3 cycles for a fixed-point'add. Without a high-speed clocking 
mechanism, superpipelining is not possible. 
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-e Performance — The minimum time required to execute 


ipelin oe 
super PIR For a superpipelined machine of degree n with k stages in the 
instruc = 
N jine ig as follows j 
Be LB) ae (N — 1) (base cycles) 
ihe potential speedup of a superpipelined machine over the base 
Thus; i 
achine is â$ DUEN O TLD) __k+N-1 _ n(k+N-1) 
E S(1, n) = T(,n) k+(N-1)/n nk+N-1 
The speedup S(1, n) > n, as NO 
h Decode Execute Write 
Ifetc 


1 2 3 4 8 Time 
(a) Superpipelined Execution with Degree n = 3 


Decode 


0 1 2 3 4 5 Time 


(b) Superpipelined Superscalar Execution with Degree m =n =3 


Fig. 3.41 Superpipelined Processor Architectures without and with 
Multiple Instruction Issues, Respectively 


Superpipelined Superscalar Design— Fig. 3.41 (b) shows a superpipelined 
superscalar processor of degree (m, n) with (m, n) = (3, 3). 
m instructions every cycle with a pipeline cycle 1/nof the base cycle are eae 
by this machine. Simple operation latency isn pipeline cycles. The level ° 
parallelism needed to fully use this machine is mn instructions. For example, the 
DEC Alpha processor provides an m = 2 instruction issue rate. Simple ony 
are directly assisted by four functional units for integer, floating-point, tac 
store and branch operations. The initial Alpha executes with a a c ie 
This specifies a superpipelining degree of n = 150 MHz/25 MHz= o r 
with a scalar processor with a clock rate of 25 MHz. Therefore, ee p 
be approximately characterized as having a degree of (m, n) = (2, 6). 
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Superpipelined Superscalar Performance—Th 


to execute N independent instructions on a superpipel 


e minimum tim 
of degree (m, n) is — 


À eregui 
ined superscalar Sate 
(N-m) 

Tm, n)=k+ ~n (base cycles) 


Therefore, the speedup over the base machine is as follows — 
T = 

Sin, = aD _ k+N-1 _ 

T(m, n) k +(N—-m)/(mn) mnk +N —m 


This speedup limit S(m, n) > mn because N > œ, aS expected 


Q.46. Explain the design of superpipeline processor with diagram, 


(R.GP.V., Dec. 2017) 
Ans. Refer to Q.45. 


0.47. Differentiate between superscalar pipeline and superpipeli 
design. me 


(R.GP.V, Dec. 201 
Or +) 


Compare the performance of the superscalar processor and 
superpipelined processor. (R.GP.V., Dec. 201 6) 


Ans. Refer to Q.40 and Q.45. 


Q.48. Discuss supersymmetry and design tradeoffs in superscalar and 
superpipelined processors. 


Ans. Fig. 3.42 shows the comparison between superscalar processor 
and a superpipelined processor, both of degree 3, issuing a basic block of six 
independent instructions. The issue of three instructions per base cycle is the 
rate for the superscalar machine, whereas the superpipelined machine takes 
only one-third of the base cycle to issue each instruction. Both machines will 
execute the same number of instructions during the same interval in the steady 
state. 


Decode 


0 1 2 3 4 5 6 7 8 Time 


(a) Startup Delays in Superscalar versus Superpipelined Processors 
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Superscalar 


2.0 Superpipeline 


1 2 3 4 5 6 7 8 9 10 
Degree of Superscalar or Superpipeline 


(b) Relative Performance with Respect to Base Scalar Processor 
Fig. 3.42 


Supersymmetry — There is a longer startup delay in the superpipelined 
machine. The superpipelined machine lags behind the superscalar machine at 
the beginning of the program as shown in fig. 3.42 (a). In addition, a branch 
may cause more damage on the superpipelined machine than on the superscalar 
machine. This effect decreases because the degree of superpipelining increases 
and all issuable instructions are issued closer together. nee 

As shown in fig. 3.42 (b) there is a duality of latency and a parallel issue 
of instructions. ; 

These two curves were achieved by Jouppi and Wall in 1989 after 
simulation runs of eight benchmarks on an ideal base machine, ona superscalar 
machine of degree m and on a superpipelined machine of degree n, where 2 <m, 
n< 8. 

These simulation results indicate that the superpipelined machine has a 
lower performance compared to the superscalar machine and naw 
gap decreases with increasing degree. This ensures the extra startup overhea 
and higher branching damages reported. i 

The smaller latency in issuing instruction is traded for more ls nee 
cycles, but both types of machines have the similar simple operation a ; 
Therefore they should perform equally in the steady state. 


So e for 
Design Tradeoffs — Fig. 3.43 shows plotting of the speedup curv 


for k = 8 pipeline stages. In this figure, four cases 


| 
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are plotted with respect to single-issue, dual-issue, triple i 
p p g , triple issue and quad-issye 


machines (m = 1, 2, 3 and 4 


respectively). All speedup curves are _ 


functions of N, the number of 
instructions being executed. The 
speedup curves show steady increases 
when the degree of superpipelining 
increases. The improvement of pipeline 
performance is same when the 
instruction issue rate increases in these 
plots. Design tradeoffs are present in 
regard to the choices of (m, n). 

Two important limitations were as 
follows — 

(i) The superscalar degree 
m is restricted by the small ILP 
encountered in programs. 

(ii) The superpipeline clock 
cycle is restricted by the multiphase 
clocking technology available for 
distributing clock phases and by the 
long setup time of registers. 
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Fig. 3.43 Ideal Speedup of 


Superpipelined and/or 


Superscalar Machines of Degrees 
(m, n), Compared with the 
Scalar Base Machine with a Unity 


Speedup S(1, 1) = 1 
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Q.1. Briefly discuss the cache coherence. 


(R.GP.V., June 2006, Dec. 2009) 


Or 


What is cache coherence ? 


(R.GPV., Dec. 2015) 


Or 


Write short note on cache coherence. 


(R.GRV., Dec. 2017) 


Ans. In a multiprocessor system, the presence of private caches introduces 
the problem of cache coherence, that may lead to data inconsistency. It means, 
various copies of the same data may occur in different caches at any given 
instant. This problem exists in a uniprocessor with cache when processor 


can be active after modifying a word 


in the cache and before the copy in 


memory has been updated. The cache coherence problem is only a theoretical 
observation without practical bearing, if the processor is the only unit to 


access memory. However, practical sy 
access to memory. 


stems contain I/O units which require 


This is a big problem in asynchronous parallel algorithms that do not 
possess explicit synchronization stages of the computation. For instance, 
process pa, that runs on processor i, generates data x, that is to be consumed ». 


by process pp, that runs on processor j 


+ i asynchronously. Process PA writes 


a new data x into its cache while process pp employs the old value of x in its 
cache because it is unaware of the new x. Process pg May continue to use the 
old value of x in its cache unless it is informed of the presence of the new x in 
process p,’s cache so that a copy of it may be made in its cache. The possibility 
of having several processors employing different copies of the same data 
must be avoided if the system is to perform correctly. 

A system of caches is coherent if and only if a READ is carried out by 
any processor i of a main memory location x always provides the most recent 
value with the same address x. Thus, whenever a WRITE is performed by 


| 
| 
l 
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one processor i to a memory location x, completion of WRITE must e 


that all subsequent READS Of Pprosesort] [Procesora] «+. 


location x by any processor will 
provide the new contents of x 
until another WRITE to x 1S 
completed. The cache 
coherence problem exists only | Main Memory 
when caches are linked with 
the processors. There are 
certain designs, in which 
caches are linked with the TE 


shared memory as depicted in Fig. 4.1 Caches Associated with Shares 


fig. 4.1. This avoids the cache Wremory Lines to Avoid D 
ata i 
coherence problem. y Inconsistency 
This architecture is better for those systems that have a small number of 


cessors. But, the potential gain in speed is then limited by the transmission 
delays through conflicts at the caches and by interconnection network. This 
technique.has been shown to be satisfactory for multiprocessors in which each 
processor is pipelined and executes multiple independent instruction streams, 


NSure 


pro 


Q.2. Discuss the various ways to cause cache inconsistencies. 
Ans. The cache inconsistencies can be caused by data sharing, process 


migration and I/O. 

(i) Inconsistency in Data Sharing — The problem of cache 
inconsistency takes place only if we use multiple private caches. Generally, 
there are three sources of the problem, viz. sharing of writable data, process 
migration, and I/O activity. The difficulties caused by the first two sources 
are shown in fig. 4.2. Suppose, there is a multiprocessor system with two 
processors, each having a private cache and both sharing the main memory. 
Assume D be a shared data element that has been referenced by both processors. 
The three copies of data element D are consistent before update. 

When processor 1 writes new data D' into the cache, the same copy will 
be written immediately into the shared memory with the help of write-through 
policy. In this situation, as shown in fig. 4.2 (a), an inconsistency takes place 
between the two copies of data element (D' and D) in the two caches. 

In contrast, an inconsistency in data sharing, as shown in fig. 4.2 (a), 
may also takes place when a write-back policy is used. When the modified 


data in the cache are replaced or invalidated the main memory will be eventually 
updated. 


(ii) Process Migration — The occurrence of inconsistency is 
depicted in fig. 4.2 (b), after a process having a shared data D migrates from 


— 


ee 
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processor 2 employing the write-back cache on the right. In 


processor | to 
-through caches, a process migrates from processor 


middle, when using write 


2 to processor 1. l 
In both cases, inconsistency takes place between the two cache copies, 


~(D and D'). These inconsistencies must be avoided by taking special 
precautions. A coherence protocol must be set-up before processes can safely 


migrate from one processor to another. 


Processor 1] | Processor 2 Processor 1 | | Processor 2 Processor 1} | Processor 2 


Shared Main 
Memory 


Before Update Write-through Write-back 
(a) Inconsistency in Sharing of Writable Data 


eee} 


Shared Main 
Memory 


Before Migration Write-through ` Write-back 
(b) Inconsistency after Process Migration ; 
Fig. 4.2 Cache Coherence 


(iii) Input/Output — Problems of inconsistency may take place durin 
VO operations that bypass the caches. When a new data D' is loaded by yO 
m into the main memory, bypassing the write-through caches [see 
ig. 4.3 (a)], inconsistency takes place between cache 1 and the shared memory. 


The write-back caches also i i 
create inconsistency when outputtin ire 
i a data d 
from the shared memory that is, bypassing the caches. Aa io 


Kaa o. the I/O inconsistency problem one possible solution is to 
eee. processors (IOP, and IOP.) to the private caches (C, and 
; ay sr ake a aa in fig. 4.3 (b). In this way, I/O a 
a the CPU. The TO consistency can be handled if cache-to- 

stency Is maintained through the bus. The drawback of this scheme 


is the likely increase in ca i 
che perturbatio ; 
that may provide higher miss aT ns and the poor locality of I/O data, 


į 
i 
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pecceessesese= 


Memory (Input) Memory (Output) 
(Write-through) (Write-back) 


i| | (a) I/O Operations Bypassing the Cache 


1/0 1/0 
Processor 2 


| Processor 1 


(b) A Possible Solution 
| Fig. 4.3 Cache Inconsistency 


yi Q.3. Explain multilevel cache coherence. 


Ans. Fig. 4.4 shows the consistency of caches at various levels. Wilson 
proposed an extension to the write-invalidate protocol used on a single bus to 
maintain consistency among caches at different levels. 


S d i 
Level mor rca 
Caches LLL LLL LLL Lh 
Cluster 
Bus 
} rint 
eve 
| Caches 
| proceso Po] CJ LJ CACACE Pa 


Fig. 4.4 


To invalidate all. copies in the shared caches at level 2, an invalidation 
must propagate vertically up and down. Consider processor 0 issues a write 
request. The write request propagates upto the highest level and invalidates 


Unit- IV 153 


copies in C9, Cy, C13» and Cy5, as shown by the arrows from Cy, to all the 


shaded copies. 


High-level caches like Czo keep track of dirty blocks beneath them. A 


quent read request issued by P4 will propagate up the hierarchy since no 
exist. Cache Cy issues a flush request down to cache Cyq, when the 
uest reaches the top level. In this case, the dirty copy is supplied to the 
he associated with processor P4. For consistency control the higher- 


There is no propagation of a read request or an 
a copy of the 


subse 
copies 
read req 
private cac 


level caches work as filters. 
invalidation command down to clusters that do not have 


corresponding block. The cache C,, works in this way. 


‘ate between write-through and write-back cache. 


0.4. Differentia 
—— (R.GP.V., Dec. 2006) 


Or 
-through versus write-back cache associated with 


Explain the term write 
(R.GPV., June 2011, Dec. 2014) 


cache design. 
Ans. An important aspect of cache organization is concerned with memory 


write requests. When the CPU finds a word in cache during a read operation, 
the main memory is not involved in the transfer. However, if the operation is a 
write, there are two ways that the system can proceed. 

The simplest and most commonly used procedure is to update main 
memory with every memory write operation, with cache memory being updated 
in parallel if it contains the word at the specified address. This is called the 
write-through method. This method has the advantage that main memory 
always contains the same data as the cache. This characteristic is important in 
systems with direct memory access transfers. It ensures that the data residing 
in main memory are valid at all times so that an I/O device communicating 
through DMA would receive the most recent updated data. 

The second procedure is called the write-back method. In this method 
only the cache location is updated during a write operation. The location is 
then marked by a flag so that later when the word is removed from the 
cache it is copied into main memory. The reason for the write-back method 
is that during the time a word resides in the cache, it may be updated 
several times; however, as long as the word remains in the cache, it does 
not matter whether the copy in main memory is out of data, since requests 
from the word are filled from the cache. It is only when the word is 
displaced from the cache that an accurate copy need be rewritten into 
main memory. 


Q.5. List the various types of cache event and actions. 


Ans, The following events and actions are triggered by the memory- 
access and invalidation commands — um 


; 


=] 
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(i) Read Miss- A read miss takes place, if there is a processor th 
wants to read a block that is not present in the cache. A bus-read operation Fi 
be initiated. The main memory contains a consistent copy and supplies a co 
to the requesting cache, if no dirty copy exists. The cache will inhibit the ian 
memory and send a copy to the requesting cache, if a dirty copy exists. The 
cache copy will enter the valid state after a read miss in all cases. 

(ii) Read Hit — Read hits are always carried out in a local cache 
using the snoopy bus for invalidation or without causing a state transition. 


(iii) Write Hit — The write can be carried out locally and the new 
state is dirty, when the copy is in the dirty or reserved state. A write-invalidate 
command is broadcast to all caches, invalidating their copies when the new 
state is valid. After this first write, the shared memory is written through and 
the resulting state is reserved. 

(iv) Write Miss — The copy must come either from the main memory 
or from a remote cache with a dirty block, if the processor fails to write in a 
local cache. This is accomplished by sending a read-invalidate command which 
will invalidate all cache copies. Hence, the local copy is updated and ends up in 
a dirty state. 

(v) Block Replacement — No block replacement will occur when 
the copy is clean. However, if a copy is dirty, it has to be written back to main 
memory by block replacement. 

Q.6. What are the various cache design alternatives ? 

Ans. The three design alternatives are as follows — 

(i) Shared Caches—An alternative method to maintain cache coherence 
is to completely remove the problem by employing shared caches connected to 
shared-memory modules. In this case, no private caches are permitted. This method 
will decrease the main memory access time but contributes very little to resolving 
access conflicts and to decreasing the overall memory-access time. 

As second-level caches, shared caches can be made. Sometimes, second- 
level caches are partially shared by various clusters of processors. When both 
private and shared caches are used in a memory hierarchy, then various cache 
architectures are possible. The use of shared cache alone may be against the 
scalability of the whole system. 

(ii) Noncacheable Data — Another method is not to cache shared 
writable data. Only instructions or private data are cacheable in local caches 
and shared data are noncacheable. Shared data are process queues, locks and 


any other data structures=protected. by critical sections. The data must be 


tagged by the compiler as either cacheable or noncacheable. Special hardware, 
tagging must be used to. distinguish them. Caches with cacheable and 
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le blocks need more programmer effort, besides support from 
compilers. 
(iii) Cache Flushing — A third alternative is to use cache flushing 


when a sync 
multiprocess© 


of cache flushes 1 
process migration problems cannot be solved by this method. To increase 


efficiency, flushing can be made very selective by programmers or by the 
compiler. Cache flushing at VO, synchronization and process migration may 
be carried out selectively or unconditionally. Oftenly, it is used with virtual 


address caches. 
Q.7. Explain the term factors affecting cache hit ratios associated with 
cáche design. (R.GEBV, June 2011, Dec. 2014) 


Ans. The cache hit ratio is affected by the cache size and by the block 
wn in fig. 4.5. Generally, the hit ratio increases with respect to 
increasing cache size. A 100% hit ratio should be expected when the cache 
size reaches infinity. However, this is never possible due to the size of cache. 

With a fixed size cache, cache performance is rather sensitive to block 
size. The hit ratio enhances as the block size increases. The increase reaches 
its peak ata certain optimum block size. After this point, the hit ratio decreases 
as the block size increases, this is due to mismatch between block size and 
program behaviour. As a matter of fact, as the block size becomes very large, 
many words fetched into the cache may never be used. Also the temporal 
locality effects are gradually lost with larger block size. At last, the hit ratio 
reaches zero when the block size equals the entire cache size. 


size as sho 


1 


— 


Hit Ratio 
(with fixed cache size) 


2 
a 
4 
Cache Size Block Size 
(bytes) (bytes) 
(a) Hit Ratio versus Cache Size (b) Hit Ratio versus Block. Size 


Fig. 4.5 


Q.8. Discuss the cache performance issues. (R.GPV,, Dec. 2016) 
Ans. The performance of cache is based on cycle count and hit ratio of 
cache. Cycle court is number of cycles needed for cache operations (i.e. 
aa coherence control, update etc.). Hit ratio provides how effectively 
ata is encountered in cache. There are various factors which affect the cache 
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performance. Some of them are given below — 
(i) The cycle count is affected by cache size, associativit : 
through or write back policies, set number, block size, cache RT Ri 
and dynamic or static RAM technology. ation, 
(ii) Hitratio can be affected by variable cache or block size, Ex 
hit ratio is 100% when cache size tends to infinity. Pected 
(iii) Block size of cache affect the cache performance. Increagin 
block size increases the hit ratio upto a certain peak value. After this, hit aie 
will decrease with increment of block size. 
(iv) In set associative cache, increasing the number of sets results 
in decreasing hit ratio. 


„Q9 Explain MESI protocol for cache coherence with suitable example, 
(R.GP.V., June 2011) 
Ans. A cache controller must keep careful track of the state of each 
cache block (line) under its control to maintain consistency in a multiprocessor 
or in a uniprocessor with independent IO processors. It does so by attaching 
a few state bits to every block stored in the cache data memory and processing 
the states according to some coherence algorithm or protocol, as it is often 
called. Microprocessors and some PowerPC models employ a standard cache 

coherence protocol based on the following four states — 
(i) Modified — The line in the cache has been modified (different 

from main memory) and is available only in this cache. 

(ii) Exclusive — The line in the cache is the same as that in main 


memory and is not present in any other cache. 


Reset Snoop Read 
Read Miss Hit 


(Shared Data) 


Write Hit 


Read or 
Write Hit 


Fig. 4.6 
(iii) Shared—The line in the cache is the same as that in main memory 


and may be present in another cache. 
(iv) Invalid — The line in the cache does not contain valid data. 


So a A a 


a OH 


=—<——— = 
. -r 
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ë control algorithm using these states is known as the MESI 
ocol. Fig. 4.6 show a simplified version of the MESI protocol 
ow the states of a cache block change in response to various 
d write conditions, assuming that a write-back policy and a cache- 
read an ism are used. We also assume a one-level cache, although 
qually well with multiple cache levels. 


Q.10. Explain the following memory update policies — 

(i) Write-through caches 

(ii) Write-back caches 

(iii) Write-once protocol. 
Write-through Caches — In a cache, the states of a cache block 
y change with respect to read, write and replacement operations. The state 
sitions for two basic write-invalidate snoopy protocols developed for write- 

-back caches are shown in fig. 4.7. A block copy of a write- 

attached to processor m can assume one of two possible 
lid or invalid as shown in fig. 4.7 (a). 


Ans. Ù) 


cop 
tran . 
through and write 
through cache m 
cache states — va 


R(m), W(m) 
R(m) R(n) 
HORO 
R(n) Z(m) 
Z(n) W(n) 
W(n), Z(m) 


(a) Write-through Cache 


W(m) = Write to block in cache m 
by processor m. 

R(m) = Read block in cache m 
by processor m. 

Z(m) = Replace block in cache m. 

W(n) = Write to block copy in cache n 
by processor n + m. 

R(n) = Read block copy in cache n 
by processor n # m. 
Z(n) = Replace block copy in cache n + m. 


R(n), Z(n), W(n), Z(m) 
(b) Write-back Cache 
Fig. 4.7 State-transition Graphs for a Cache Block using 
Write-invalidate Snoopy Protocols 
seer ae processor is represented by n, where-n.# m. There 
nah bs which may take place for each of: the two cache states. It is 
bie e same transition graph is used by all cache copies of the same 
in making state changes. 

F Ma koo in fig. 4.7 (a), all processors can read (R(m), R(n)) safely in 
- Local processor m can also write (W(m)) safely in a valid state. 


are six 
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The invalid state corresponds to the case of the block 
being invalidated (Z(n) or Z(m)). All other cache copi 
remote processor writes (W(n)) into its cache co 
(R(m)) or write (W(m)) is performed by using a Jo 
cache block in cache m is valid. 


either bei 

being re 

es are invalidated k 
py. If a Success = 


cal Processor m 
> 


n 
ful tip 
3 then the 
Because of the requirement of request invalidations th 
cycles is lower on the bus as compared to the Kn fraction Of read 
write-through cache. The cache directory can be made in d write cycles in g 
ported to filter out most invalidations. ual copies or dual. 


(ii) Write-back Caches — As shown in fig. 4.7 (b), th » 
a write-back cache divided into two cache states — fead-write fat alid state of 
only (RO). The invalidated (INV) or not-in-cache is equivalent i E na Ts 
state of fig. 4.7 (a). This three-state coherence scheme cores: aliq 
- ; o an 


ownership protocol. 


Caches can have only the RO copies of the block when the memo 
. . F A TY own 

a block. Thatis, multiple copies may present in the RO state and every proces 
containing a copy known as the keeper of the copy, can read (R(m), R(n)) 7 
copy safeiy. Whenever the local processor replaces (Z(m)) its own block a 
or a remote processor writes (W(n)) its local copy, the INV state is entered. The 
RW state i analogus to only one cache copy existing in the whole system 
owned by the local processor m. Read (R(m)) and Write (W(m)) can be safely 
carried out in the RW state. The cache block becomes uniquely owned when a 
local write (W(m)) occurs from either the INV state or the RO state. 

"2 As shown in fig. 4.7 (b), other state transitions can be similarly figured 
out. Ownership for exclusive access must first be obtained by a read-only bus 
transaction that broadcast to all caches and memory before a block is modified. 

In case, a copy of a modified block copy already available in a remote cache, 

then, the memory first need to be updated and the copy is invalidated. The 

‘ownership is then transferred to the requesting cache. 

(iii) Write-once Protocol — A cache coherence protocol for bus- 
based multiprocessors is presented by James Goodman, in 1983. This scheme 
has the benefits of both write-through and write-back invalidations. The very 
first write of a cache block employs a write-through policy to decrease the 


bus traffic. This will result in a consistent memory copy while all other cache 
: -back policy 
5 illustrated 


copies are invalidated. Shared memory is updated using a write 
‘after the first write. A four-state transition graph for this scheme i 
‘in fig. 4.8. The four cache states are given as follows — 

(a) Valid — The cache block consistent with the cop 
memory has been read from shared memory. This cache block ‘has not bee? 
‘modified (written). l aghi 


yof 


stig he 
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lid — The cache block is inconsistent with the memory 


(ovi h 
„ pot present 9 the cache. ; 
copy of is ) Reserved — Since being read from shared memory data 
a exactly once. This cache copy is consistent with the memory 
1 
has beer Mis the only other copy i l 
cop} W (d) Dirty- The cache block has been written many times and 
he copy is the only one in the system. 
the cache needs two different sets of commands for maintaining 


4.8. The solid lines are analogous to access commands 


Write Inv/Read Inv 


state is entered when a read 
miss takes place. The first 
write hit results in the 
reserved state. The second 
write hit result in the dirty 


state and all future write Geserve) wi Abb B -Coi Fw l 
hits remain in the dirty P Write rite 


state. The cache block Fig. 4.8 Write-once Cache Coherence Protocol 
enters the dirty state when ysing the Write-invalidate Policy on Write-back 
a write miss takes place. Caches 

The dashed lines in fig. 4.8 are analogous to invalidation commands issued 
by remote processors through the snoopy bus. The bus-read command 
analogous to a normal memory read by a remote processor through the bus. 
The write-invalidate command invalidates all other copies of a block. The 
read-invalidate command reads a block and invalidates all other copies. ~ 


Q.11. What is cache coherence protocol ?. Explain Goodman’s write- 
once cache coħerence protocol. (R.GPV., May 2018) 


Ans. Cache Coherence Protocol — Refer to Q.9. 
Goodman’s Write-once Cache Coherence Protocol — Refer to Q.10 (iii). 


te How can we solve cache coherence problem ? 
ns. Ti i 
0 solve the cache coherence problem there are two different methods — 


ae ome Coherence Check — For private (cacheable) and shared 
by implementin ae ble) data, static coherence check avoids multiple copies 
and noncachesh ; erent paths. The shared data structures that are modifiable 
ade directly to TES main memory. A reference to this shared data is ' 
ata minimi ; memory. In main memory, the cacheabili ‘of 

ae the number of conflicts. a ES cae 
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he probability of referencing a shared modifiable datum, t. the 
be the time to reference a datum in main memory 
datum reference is given as — : 


Let, p, be t 
cache cycle time and tm 
Then, the lower bound on a 

t 


m 

(1 —p,) te + Potm 7 te (apitos ...(i) 

The performance of this method may be quite poor for algorithms with 

intense sharing, regardless of cache size when the cycle ratio, tn/t, is large, 

The performance of this scheme is enhanced by connecting a cache module 

with each memory line. This cache module is employed to buffer the 
decreasing t,, and also the cycle ratio, t,,/t,. 


noncacheable data, thus 
the shared data is accessed using a shared cache 


In a similar scheme, ; 
while private data references and instruction fetches are carried out in private 


caches. This shared cache concept is illustrated in fig. 4.9. The shared cache 
des interleaved cache modules that may be attached to the shared memory 
cessors with the help ofan interconnection network. All data references 
cache speed except when a miss occurs in either private cache or 
when conflicts take place at the shared cache. This network 


pared to a full crossbar. 


inclu 
and pro 
continues at 
shared cache or 
is less complex as com 


Processor 1 


Private 
Cache 


Shared Data Path 


Private 
Cache 


()ee() ()ee() 


Shared Memory 


Fig. 4.9 Multiprocessor System with Private and Shared Data Paths 
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This method eliminates the contention problem at the main mem h 
the ‘hit ratio is large enough in all caches. The success of the sh he he 
concept depends on the relatively small rate of shared data refe ane 
shared data may exhibit less locality compared to private data. I ie = 
cache, the hit ratio improves with the increase in the degree of sh i e 
shared variables.: Indeed, in the shared cache, a processor may fi nafs oe 
variable even if it never referenced it before. Besides, an effective : mee 
improve the hit ratio is to increase the size of the shared cache cae 

The shared cache concept requires that data be tagged as cithe h 
private. Basically, the tagging is static. Static tags are formed duri Site ile 
time and remain the same throughout the lifetime ofthe process while j ii 
tags are formed during the execution of cooperating processes A lo ied 
mechanism monitors the history of sharing of the data space in ane h aa 
predicts the probability of sharing of the subspaces in the oa ee 
In this scheme, a data subspace could be in the private caches for off Aes 
access in one phase and in the shared caches in another phase for a ea 
sharing or vice-versa. The overhead may be not acceptable because th ae 
must be flushed to main memory. Furthermore, the migration of data s he 

generate constraints on the loader or scheduler. The taggin g of data ne a - 
the compiler be designed to detect shared and private data ee 
accomplished by explicit indication of such data sets, with the ady meee 
abstract and block-structured languages like concurrent Pascal. It S ar: 
that the shared cache concept lacks flexibility. SA 


i (ii) Dynamic Coherence Check — Dynamic coherence check 
method is more flexible as compared to the static coherence check, however 
more complex and possibly more expensive. In this method multiple co iek 
a ie dais i processor updates a location / in a cache block it 

ck the other caches to invalidate i i : ac 
called as cross-interrogate (XI). possible copies. This operation is 


In : ; 
m TE BA L oi of this method, the caches are tied on a high 
Ean a e a i gnal to all ie remote caches to specify that 
hia a Oe as been updated”, when a local processor writes 
Secs aces cache. At the same time, it writes through memory. A 
A a XI must wait for an acknowledge signal from all 
wate opera i ee of execution, before it can complete the 
Paien keiki ho a idates the remote cache location corresponding to 
when it refers to this es > e. The remote processor results in a cache miss 
containing the Te cache location, that is, serviced to retrieve the block 
then, (n — 1) XIs is i he aie If n represents the number of processors 
speed bus becomes « : ult, for each write operation. The traffic on the high- 

ottleneck when the number of processors increases. 


162 Advance Computer Architecture (VI-Sem) 


There is a potential for races if the XI requests are 
peak traffic on the bus. The Univac 1 100/80 and Hon 
contain cache-invalidate interface between every P 

A better method filters the XI requests before they ar 
control element keeps a central copy of directories of all 
flag technique is a similar scheme that considers a ee 
Fig. 4.10 shows that there are two central tables associate 


queued to accommodate the 
eywell 60/66 multiprocessors 
air of caches. 

e begin. The memory 
the caches. Presence 
e-back update policy. 
d with the blocks of 


main memory (MM). & 
A 2 = 
= i Local Z s 
Local z hae Z Flags 5 E E 
Flags > = DRNA 3 = 
est = reat el ea : 
£ w 
AE ll ss 
1 2 


Number Modified 
Present Table Table 
Block 1 
Block i 


ags for Dynamic Solution to Cache Coherence 


Fig. 4.10 Organization of Fl A 
i i ional table known as the present table in which, 
The first table is a two dimensiona e Aai eat ode. 


i resent flag for the 
sae ee of the itt block of MM, when P[i, c] = 0, else it 1s zero. 
wn as the modified table in 


The c™ cache has a cop ‘onal table kno 
-di na e 
ne-dimensio he itt block of MM. There 


The second table is 0 Sr E 
; ; 
which each entry M{[i] contains a modi je g oe ee 


is a cache with a copy of the itt block more recen 
copy in MM, when M[i] = 1. The present and 
implemented in a fast random-access memory. 

In cache coherence check an arbitrary number of ca 


of a block, provided that all the copies are same, when p ae 
with each of the caches has not tried to modify its copy because copy 


che can have a copy 
rocessor associated 


` 
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loaded in its cache. This copy is read only (RO) copy. A processor must own 
the block copy with exclusive read-write (RW) or exclusive read only (EX) 
access rights to modify a block copy in its cache. 

‘When the cache is the only one with the block copy and the copy has not 
been modified then a copy is held EX in a cache. In the same manner, a copy 
is held RW ina cache, if the cache is only one with block copy and the copy 
has been modified. Thus, for the sake of consistency, only one processor can 
own an EX or RW copy of a block at any time. 


\0.13Explain cache coherence problem and its solutions briefly. 
(R.GP.V, June 2013) 


Ans. Refer to Q.1 and Q.12. 


Q.14.-Describe snoopy bus protocols to cache coherence problem. 


Or 
Explain snoopy coherence protocol. (R.GPV, June 2011) 
Or 
What is snoopy protocols ? Where is it used? (R.GPV, June 2012) 
Or 
What is the use of snoopy protocol ? Explain. (R.GP.V., June 2014) 
Or 


Explain the concept of bus snooping by cache coherence to maintain 


coherence. (R.GP.V. Dec. 2015) 
Or 

What are snoopy protocols ? When is it used ? (R.GP.V., June 2016) 
Or 


o (R.GP.V., May 2018) 

ns. Write-invalidate and write-update are the two approaches used for 
maintaining cache consistency in using private caches associated with 
processors tied to a common bus. If a local cache is updated, then the write- 
invalidate policy will invalidate all remote copies. The new data block will be 
broadcasted to all caches containing a copy of the block by the write update 


policies. Snoopy protocols shared memo i itori 
i ry using a bus monitor 
obtain data consistency among the caches. E 


ae pas : ne that two snoopy bus protocols produce different results. 
a ‘ ogee our processors 1, 2, 3 and n maintaining consistent copies 
etna re ocal caches and in the shared-memory module marked B as 
Bee T (a). The processor 1 modifies (writes) its cache from B to 
i a orden cae are invalidate using the bus as shown in fig. 4.11 (b), by 
Seba 75 idate protocol. Invalidated blocks are sometimes known 
a tes 4 ie not usable. The write-update protocol needs the ne 

e broadcast to all cache copies using the bus as shown in 


ae Short note on snoopy bus protocol. 


ture (VI-Sem) 


o updated if write-through caches are 
r at block replacement time while 
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is als 

he memory copy ot 

used. The memory COPY is updated la 


; ‘+e-back caches. 
Processor 


fig. 4.11 (c). T 


Processor 1 


(c) After a Write-update Operation by Processor 1 


Fie. 4.11 Write-invalidate and Write-update Coherence Protocols 
sae for Write-through Caches , 


0.15. Discuss performance issues of snanny protocol. ee 

Ans. Snoopy protocol performance relies heavily on a oe aie 
and implementation efficiency. The key motivation for employ P deie 
mechanism is to decrease the bus traffic, with a secondary Pi aD i 
the effective memory-access time. In write-invalidate protoco ae ios 
is very sensitive to cache performance although, ae in write-up p 

Bus traffic and memory access time are mainly contribute n s roo 
misses for a uniprocessor system. The miss ratio reduces as the ieee 
increases. However, the miss ratio begins to increase when the ae 
increases to a data pollution point. The data pollution point seems a 
block size for larger caches. 
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The write-invalidate protocol will parinn ae ae T ae 
f i i i r synchron , ; 
require a a ein by another processor before to the 
es sof ate These invalidation misses enhance the bus traffic and hence 
| access . 
a . oe, extensive simulation results have argued that bus 
Ee as the block size increases. The implementation of synchro- 
‘nization primitives is made easy by write-invalidate. l ETON 
A bus broadcast capability is required by the write-update proto A > 
-protocol can also prevent the ping-pong effect on data shared leet 2 ‘i 
caches. In a write-update multiprocessor reducing the sharing a ata is 
lessen bus traffic. Although, write-update cannot be used with os ies : 
bursts. One can reveal the cache behaviour, hit ratio, bus traffic and effectiv: 
memory-access time, only through extensive program traces. 


Q.16. Discuss the basic concept of directory-based cache coherence scheme. 
Or 


Discuss the directory based cache coherence protocol. 
= (R-.GEV., June 2016) 


Ans. Cache coherence in a multistage network is supported by employing 
‘cache directories to store information on where copies of cache blocks reside. 
Directory-based protocols mainly differ in how the directory maintains 

. information and what information it stores. 
The first directory scheme was proposed by Tang in 1976, which used a 
, central directory having duplicates of all cache directories. This central directory 
is large in size and must be associatively searched, such as the individual cache 
, directories searched. This central directory gives all the information required to 
enforce consistency. However, there are two drawbacks in using a central 
directory for a large multiprocessor, contention and long search times. 


Censier and Feautrier in 1978 proposed a distributed-directory scheme. A 
` Separate directory is maintained by each memory module that records the 


' state is local. Although, the presence informaion specifies that caches contain 
a copy of the block. 

A read miss in fig. 4, 12, is shown by dashed lines, in cache 2 it results in 

a request sent to the memory module. The request is retransmitted to the dirty 


'S a write hit at cache 1, shown by bold lines, a command is transferred to the 
femory controller, that sends invalidations to all caches, i.e. cache 2 marked 
a the presence vector Tesiding in the directory D}. 
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aud 4 


Fig. 4.12 Directory-based Cache Coherence Scheme 


A cache-coherence protocol must keep the locations of all cached copies. 


of each block of shared data if it does not use broadcasts facility. This list of; ; 


ocations, can be centralized or can be distributed. This list is known, | 
rectory. For each block of data, a directory entry contains a number 
o mention the locations of copies of the block. Each directory 


cached | 
as cache di 
of pointers t 
entry also contain 
to w ‘ite the associated block of data. | 
Q.17. What are the different types of directory protocols ? Describe 


each of them. 
Or 


Explain about directory-based protocols. 
Ans. The types of directory protocols ae divided into three main categories — 
map directories, limited directories and chained directories. 
In global memory, full-map directories record sufficient amount of data 
related with each block. As a result, every cache can simultaneously record a 
| copy of any block of data in the system. It means that each directory entry 
has N pointers, where N represents the processors number in the system. 
Unlike the full-map directories, limited directories contain a fixed number of 
pointers per entry, independent of the system size. Chained directories emulate 
the full-map schemes by distributing the directory among the caches. 
(i) Full-map Directories — The implementation of directory entries 
ir done by full-map directories with one bit per processor and a dirty bit. Each 


bi: denotes the status of the block in the corresponding processor’s cache 
or absent). Only one processor’s bit is set when the dirty bit 1s 


(R.GP.V., Dec. 2014) 


| full 


(i.e., present 
set and that processor can write into the block. x 
i Two bits of state per block is maintained by a cache. First bit is used to 


ii specify a block validation, and the second bit is used to specify a valid block 
| -may be written. The state bits are placed in the memory directory and those 
jji that are cache consistent by the cache coherence protocol. 


s a dirty bit to mention whether a unique cache has permission » 


r 
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There are three different states of a full-map directory as shown in fig. 
4.13. Fig. 4.13 (a) shows the first state in which location X is missing in all of 
the caches in the system. Fig. 4.13 (b) shows that the second state results 
| from three caches C}, C3, and C3, that are the requesting copies of location 

X. Three pointers that are processor bits are set in the entry to specify the 
caches that contain copies of the data block. The dirty bit on the left side of 
‘the directory entry is set to clean (C) in the first two cases, showing that no 
‘processor has permission to write to the data block. Fig. 4.13 (c) shows that 
the third state results from cache C, requesting write permission for the block. 
In this state, the dirty bit is set to dirty (D) and there is a single pointer to the 
data block in cache C3. 


? 


Shared Memory 


Shared Memory 


Cache 
Cache 


Cy C2 C3 
ese 
[Processor 1] | Processor 2 | 


Processor 3 
Read X Read X Read X Write X 
(a) (b) 
Shared Memory 
v 
= 
v 
oi 
Oo eee 
[Data] 
[Processor 1] [Processor2] | Processor 3 | 
(€) 


Fig. 4.13 Full-map Directory Protocol 


voc neler the transition from the second state to the third state. When 
p ssor 2 issues the write to cache C, the following events will occur — 


riser: A aiin having location’X is valid is detected by cache 
-E : or does not contain permissi i f 
specified by the block’s write-permission bit. ak 


(b) A write request is iss 
having location X and stalls processor a ee hee Se 


(c) The invalid 
caches C, and-C3. 


ry module 


ate requests is issued by the memory module to 


Pp 
c 
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(d) Caches C; and C; after receiving the invalidate request 
set the appropriate bit to show that the block having location X is invalid i 


send acknowledgements back to the memory module. 
(e) After receiving the acknowledgements, the memory modul 
e 


sets the dirty bit, clears the pointers to caches C, and C3, and sends writ 
e 


permission to cache C). 


‘ pared copie 


° 
miy 
3 
os 
B 
= 
© 
bees | 
Lox. 4 
oO 


inter dire 
to S 
core 


The write permission message is received GRAE 

l (f) rite p g by cache C,, any given inte 
Then it updates the state in the cache, and reactivates processor 2. d th ie . 

' > present memory word, then a limited directory is sufficient t . 

er set of processors. O capture this 


The memory module waits to receive the acknowledgements before 


enabling processor 3 to complete its writ 
that the memory system ensures sequential consistency by waiting for 


acknowledgements. For the performance of centralized directory-based cache 
coherence, the full-map protocol offers a useful upper bound. Although, it is f 
not scalable because of the greater memory overhead. ' 

The memory consumed by the directory is proportional to the size of 
memory O(N) multiplied by the size of the directory O(N), because the size of 
the directory entry associated with each block of memory is proportional to l 
the number of processors. Therefore, the total memory overhead scales as 
the square of the number of processors O(N?). 

(ii) Limited Directories — Directory size problem can be solved by 
using limited directory protocols. The growth of the directory can be limited 
to a constant factor by restricting the number of simultaneously cached copies 
of any particular block of data. ; 

The classification of a directory protocol is done as Dir; X, in which symbol 


i represents the number of pointers, and X represents NB for a scheme with no 
broadcast. Dir; NB denotes a limited directory protocol that employs 


i < N pointers. DiryNB represents a full-map scheme without broadcast. The 

limited directory protocol is same as the full-map directory protocol, they differ in 

the situation when more than i caches request read copies of a specific data block 

The situation when three caches request read copies in a memory system 

with a Dir, NB protocol is depicted in fig. 4.14. In this situation, the two 
Shared Memory 


Shared Memory 


Cache 


m 


Processor 3| 


Cache 


CQ, 
Data ees 


x [el 1} Eel 
KS 


Cı 
[Data] eae 
[Fna] [Procesor2| 
Read X 


Fig. 4.14 Limited Directory Protocol 


e transaction. The protocol guarantees | 


J 
/ 


repre 
limited directory scheme is O(N log,N). 
(iii) Chained Directories — The scalabili sits 
; ; ; : of | i . 
realized by chained directories without limiting the a E 
data blocks. This type of cache coherence scheme is known as a ch red copies of 
since it contains track of shared copies of data by managing a cha anes rea 
irectory 


pointers. This scheme implements a singly linked chain as shown in fig. 4.15 


Shared Memory 


Shared Memory 


Processor 3 
Write X 


| Processor 1 f [ Processor2 | 


Read X 
Fig. 4.15 Chained Directory Protocol 


Conside 
Pes ee ee a no shared copies of:location X. The memory 
eee eh b ong with a chain termination (CT) pointer, when 
T whee n X. The memory also holds a pointer to Bhe C 
to cache C> along a 2 reads location X, the memory sends a copy 
pointe otis l. e pointer to cache C}. Then the memory keeps a 
All of the ca 
step. When Pa = cache a copy of the location X by repeating the above 
invalidation a a ee to location X, it is required to send a data 
write permission. uatil e chain. The memory module denies pprocessor 3 
acknowledges eran processor with the chain termination pointer 
accuracy. This scheme ah of the chain, in order to maintain the sequential 
passed fom individuali loul d be known as a gossip protocol as information is 
o individual instead of being spread by covert observation. 


1 


3 
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Chained-directory protocols are complicated due to the poshi of cache 
block replacement. Assume that caches Cy through Cy a pid st of 
location X and that location X and location Y map to the same ( -mapped) 
cache line. When processor i reads location Y, it must first evict location X 
from its cache with the following possibilities - . at 

(a) In cache Cit invalidate location X using cache Cy. . 
(b) Witha pointer to cache C;+1, send a message down the chain to 


cache C;_, and splice C; out of the chain. 


The first scheme can be implemente 
e second. Sequential consist 


ed by a less complex protocol in 
ency in either case is maintained 
ry location when invalidations are in progress. A use of 
s the another method for the replacement problem. For 
cheme uses forward and backward chain pointers, 
a cache replacement, the protocol does not have to 
ment condition is optimized by the doubly linked 
average message block size, a more complex 
pointer memory in the caches. 


The chained protocols are more complex as comparison to the limited 
directory protocols. However, the chained protocols are still scalable in sense 
of the amount of memory utilized for the directories. The number of pointers 
per cache or memory block is independent of the number of processors, 
while the pointer sizes increases as the logarithm of the number of processors. 


comparison to th 
by locking the memo 
doubly linked chain i 
each cached copy, this s 
As a result, when there is 
traverse the chain. The replace 
directory at the cost of a larger 
coherence protocol, and double the 


Q.18. Discuss and compare the merits 


protocols and directory-based protocols. 
Or 


(R.GBY., Dec. 2010) 


List two approaches to cache coherence protocol. (. 
Ans. Refer to Q.14 and Q.16. 


0.19. What is meant by cache coherence 
protocols for cache coherence. 


Ans. Refer to Q.1, Q.14 and Q.16. 


problems ? Describe various 
. (R.GEV., June 2015) 


NETWORI 


PRES REELED HS RE 


0.20. What are the different message-routing schemes ? 
g Or 


Explain message routing schemes in multicomputer network. 


(R.GPV., Dec. 2010, June 2011, 2013) 


Or 


and demerits of snoopy bus 


R.GPV., June 2016) 


$ 
| 
! Discuss message routing schemes for multicomputer network. 
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(R.GB.V., June 2012, Dec. 2014) 


Or 


Describe the message routing schemes in multicomputer network. 
(R.GP.V, Dec. 2015) 


Ans. There are two different message routing schemes — 
(i)_Store-and-forward Routing — In a store-and-forward network 


packets are the basic unit of information flow as shown in fig. 4.16. Each 


node uses a packet buffer. A packet is sent from a source node to a destination 
node via a sequence of intermediate nodes. A packet is first stored in the 
buffér when it reaches an intermediate node, Afterwards it is forwarded to the 
next node if the desired output channel and a packet buffer in the receiving 


node are both present. 


Destination Node 


(Zia (ZZ 


Intermediate Nodes 


Fig. 4.16 Store-and-forward Routing using Packet Buffers in 
Successive Nodes 


In the first-generation of multicomputers, store-and-forward routing was 


Packet Buffer 


Wit 


Source 
Node 


We Cl 


implemented. In store-and-forward networks, the latency is directly proportiona 


to the distance between the source and the destination. 


nen ka Wormhole Routing — Newer multicomputer implement the 
pes es scheme, by partitioning the packet into smaller flits, as 

: in fig. 17. Flit buffers are used in the hardware routers attached t 
nodes. The transmission from the sour “to the | s 
aha ce node to the 
pe ed via a sequence of routers. 


ae 


destination node-is-— 


Intermediate Nodes 


usi : ; . 
using Flit Buffers in Successive Routers 


, all the flits are transmitted in order as inseparable 


Fig. 4.17 Wormhole Routing 


=O 
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train with an engine car (i.e. t 
box cars (i.e., data flits). It is on 
train (packet) is goin 
(i.e., box cars). Durin 
Although, the flits fro 
Otherwise they may be towe 
wormhole routing is almost indepe 
and the destination. 

Q.21. Give the time comparison be 


wormhole-routed networks. E 
r 


Discuss the 
routed networks. 


he header flit) towing a long sequence 
ly known to the header flit that wh ere the 
g. The header flit must be followed by all the data fli 
g transmission, different packets can be inter] 

m two different packets cannot be combined 
d to the wrong destinations. The latency f ; 
ndent of the distance between the e 


eaved, 


tween store-and-forward ang 


latency analysis between store-and-forward and wormhole- 


. Fig. 4.1 in bi 
Ans. Fig networks. Suppose that L be the length of packet in bits, w 


wormhole-routed 
be the channel ban 


flit in bits. 
For a store- 
represented as — 


L 
Top = we +1) 


d width in bits/s, D be the distance and F be the length of 


(i) 


For a wormhole-routed network, the communication latency Tyr is 


represented as — 
L F 


(ii) 


Equation (i) indi-cates that Tgp is directly proportional to D. In equation 
(ii), Tyg = L/W when L >> F. Hence, the distance D has a negligible effect on 


the routing latency. 


Nodes 


SLL 


N LLLI 
Packet 


LW} 


(a) Store-and-forward Routing 


h= 


Time 


and-forward network, the communication latency Tgp is 


eof . 


a 


8 shows a time comparison between store-and-forward and / 
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(b) Wormhole Routing 
Fig. 4.18 Time Comparison between the Two Routing Techniques 


We have neglected the block time and network startup latency because of 
the resource shortage like buffers being full or channels being busy etc. The 
channel propagation delay has also been neglected as it is much smaller 


compared to those terms in Tsp or Twy- 


0.22. Explain message format used as information units of 
communication in a message passing network. 

Ans. Fig. 4.19 shows the information units used in message routing. A 
message is the logical unit for internode communication. A message may have 
a variable length because the message is assembled by an arbitrary number of 
fixed-length packets. A packer is the basic unit containing the destination address 
for routing purposes. A sequence number is needed in each packet to allow 
reassembly of the message transmitted because different packets may arrive 
at the destination asynchronously. 
~ A packet can be further partitioned into a number of fixed-length flits 
(i.e., flow control digits). Sequence number and routing information occupy 
the header flits. The remaining flits are the data elements of a packet. Packets 


are the smallest unit of information transmission in multicomputers with store- 
and-forward routing. 


Packar EE 
ETR a R : Routing Information 
Flit! D ! E S : Sequence Number 
LEa ei pis }R]. b; Data Only Flits 


Fig. 4.19 The Format of Message, Packets, and Flits 


i r the other hand, in wormhole-routed networks, packets are further 
oe i e into flits. A 256-node network needs 8 bits per flit. The length of 
18 often affected by the network size. The routing scheme and network 


c 
J 
J 
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etermine the packet length. Typically, the packet 
to 512-bits. On the basis of message length, the 
two flits. Other factors that are need to be 
es and packet are channel bandwidth, 
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implementation are used to d 
lengths are in between 64-bits 
number may take one or tw 
for the selection of flit siz 
network traffic intensity, router design, etc. 

0.23. How asynchronous pipelining is done in flits in a packet ? 

Ans. Fig. 4.20 illustrates that the pipelining of successive flits in a packet 
is performed asynchronously using a handshaking protocol. A 1-bit ready/ 
request (R/A) line is used between two adjacent routers along the path. 


Fig. 4.20 (a) shows the case when the receiving router Y is ready to 
the flit buffer is available), it pulls the R/A line low. When the 


sequence 
considered 


receive a flit (i.e., lable) 
sending router X is ready to send a flit, it raises the R/A line high and transmits 
flit i using the channel as shown in fig. 4.20 (b). 

Router Y Router X RIA (High) Router Y 


Router X 


(a) Yis Ready to Receive a Flit 


(b) X is Ready to Send Fliti 


Router Y 


Router X 


Router X Router Y 


(d) Flit i is Removed from Y’s Buffer 
and Flit i + 1 Arrives at X’s Buffer 


Fig. 4.20 Handshaking Protocol between Two Wormhole Routers 

Fig. 4.20 (c) shows the case when the flit is being received by Y, the 
RJA line is kept high. After flit i is removed from Y’s buffer i.e., is transmitted 
to the next node, the cycle repeats itself for the transmission of the next flit 
i+ 1 till the entire packet is not received as shown in fig. 4.20 (d). 

The clock used in asynchronous pipelining is faster compared to the one 
used in synchronous pipeline. Asynchronous pipelining can be very efficient. 
When flit buffers or successive channels along the path are not present during 
specific cycles, then the pipeline can be stalled. 

0.24. Explain store-and-forward routing. Wormhole routing and its 
handshaking protocol associated with message-passing mechanism. 


(R.GP.V., May 2018) 
Ans. Store-and-forward Routing — Refer to Q.20 (i). 
Wormhole Routing — Refer to Q.20 (ii). 
Handshaking Protocol Associated with Message Passing — Refer to Q.23. 


(c) Flit i is Received by Y 
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0.25. Explain the flow-control strategies for resolving packet collision. 


Ans. There are four methods to resolve the collision between two packets 


requesting for the use of the same outgoing channel at an intermediate node. 
Packet 1 is being allocated the channel, and packet 2 being denied. 
ollision, pure wormhole-routing uses a blocking policy, 


In case of packet c 
The second packet is being blocked from advancing, 


as shown in fig. 4.21 (a). 
although, it is not being given up. 

The second policy is called the discard policy, that simply drops the packet 
being blocked from passing through as depicted in fig. 4.21 (b). 

Fig. 4.21 (c) shows the third policy known as detour. The packet that is 
blocked is sent to a detour channel. The blocking policy is economical to 
implement. However, this policy leads to the idling of resources that are assigned 
to the blocked packet. 

In packet routing, detour routing provides more flexibility. Although, the 
detour may waste more channel resources than required to reach the destination. 
a nie a p packe: may enter a cycle of livelock, that wastes 

etwork resources. This detour poli i 
ee policy have been used by the connection 


Flit Buffer Flit Buffi 
Packet 1 i Packet 1 ‘i 
find Outgoing z 
SOY 
Packet 2 et a Channel 
ae Packet 2 
Control 
a) Blocki i 
(a) king Flow Control (b) Discard and Retransmission 
Flit Buffer 


Flit Buffer 


Packet 1 
Packet 2 


Outgoing 
Channel 


Packet Buffer 


Detour Channel 


(c) Detour after being Blocked (d) Buffering in Virtual Cut-through Routin, 
iene ect ei 4.21 Flow Control Methods 7 
k a evised by Kleinrock and Kermani has been proposed 

Sree ep bee ta scheme. In this scheme, packet2 is temporaril 
Pee ries ae = as shown in fig. 4.21 (d). This packet will be a 
a ee ecomes free. This buffering method has the benefit of 

a e E already assigned. Although, it demands the use of 
hee e complete packet. Moreover, the packet buff i 
ion path should not make a cycle. The packet ea 

nno 
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1 be built into a r 
T the packet buffer 
the store-and-forward 
gives a compromise. This sch 
in the worst case. However, the sc 
routing when there are no collision takes place. 


c 
: 0.26. Discuss the various situations causing deadlock. 
Or 


What is deadlock ? 
Or 


Write short note on deadlock. 


Ans. 
at buffers or 
forward network is illustrated in fig. 

Node A Node B 


ENENENEN 
Packet Buffer Packet Buffer 
Packet Buffer 
fEJETEJE poma 
Packet Buffer 
Node E Node D 
(a) Buffer Deadlock with Store-and-forward Routing 
Router A Message 3 
A 


Router B 


(b) Channel Deadlock with Wormhole Routing 
Fig. 4.22 Deadlock Situations 


outer chip. It may want the use of local memory to implement 
that may results in significant storage delay. By combining 
and wormhole routing schemes, the virtual method 
eme will behave like a store-and-forward network 
heme should perform as well as wormhole 


(R.GPV, Dec. 2015) 


(R.GEV., Dec. 2017) 


There are two types of deadlock situations caused by a circular wait 


at communication channels. A buffer deadlock for a store-and- 
4.22 (a). A circular wait situation results 
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from four packets occupying four buffers in six nodes. The deadlock cannot 
s misrouted or c ed. A channel deadlock 


be broken, unless one packet is misrouted or discarded. A ci 
results from four messages being simultaneously transmitted along four 
channels in a-mesh-connected network using wormhole routing as illustrated 
in fig. 4.22 (b). 

cQ:27, Discuss the concept of virtual channels. 

Or 

Define virtual channel. Explain what is its need. (R.GP.V., June 2012) 

Ans. Ina wormhole-routed multicomputer network, the communication 
channels between nodes are actually shared by number of possible source 


and destination pairs. The concept of virtual channels rises due to the sharing 


= eee , 
of a physical channel. A logical link between two nodes is a virtual channel. 


It is built by a flit buffer in the source node, a flit buffer in the receiver node 
and a physical channel between them. Four virtual channels sharing a single 
physical channel with time multiplexing on a flit-by-flit basis is illustrated in 
fig. 4.23. Figure shows that five flit buffers are used at the source node and 
receiver node, respectively. One Flit Buffers in Flit Buffers in 
source buffer is paired with one Bouree Node Destination Noge 
receiver buffer to make a virtual 
channel when the physical 
channel is assigned for the pair. 
In other way, we can say 
that the physical channel is time- 
shared by all the virtual channels. 
Apart from the buffers and ; Channel 
channel involved, some channel 
states must be recognized with 
different virtual channels. The 
source buffers hold flits awaiting 
use of the channel. The receiver 
buffers hold flits just transmitted 
over the channel. The channel 
gives a communication medium Fig. 4.23 Concept of Virtual Channels 


between them. 


To i : 
o implement the Virtual channels a multiplexer, a demultiplexer and 


crossbar-switch- control are required. 
) 


0.28. Explain deadlock and virtual channel in multicomputer networks. 
(R.GP.V., Dec. 2010) 


Ans. Refer to Q.26 and Q.27. 


T 


o'g 


w wu A 
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how the virtual channels can be used for deadlock 
(R.GEPV., Dec. 2007) 


Ans. The deadlock cycle can be break by using two virtual channels, y 
and V, as shown in fig. 4.24 (c). A modified channel-dependence graph is 
obtained by employing the virtual channels V; and V3, after the use of channe| 
instead of using again C3 and C4. The cycle in fig. 4.24 (b) is being 
l, hence preventing a deadlock. When packet length iş 
nel multiplexing can be done at the packet level or 
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9.29. Explain 
avoidance ? 


C2, 
changed to a spira 
sufficiently small, the chan 


flit level. 
The implementation of virtual channels can be performed with either 


unidirectional channels or bidirectional channels. When two unidirectiona| 
channels are combined into a single bidirectional channel, then it will increase 
the utilization rate as well as double the channel bandwidth. Although, arbitration 
is more involved in a bidirectional channel. A special arbitration line is required 
between adjacent nodes interconnected by a bidirectional channel. This line is 
used to control the direction of information flow. Practically, bidirectional 
channels may introduce more delay because of direction arbitration and greater 


costs because of increased control complexity, as compared with unidirectional 
channels. A bidirectional channel may be more efficient when the network 


traffic is less. 


@ 
C3 © 
© 


(b) Channel-dependence Graph 
Containing a Cycle 


OMO 


@ 
mH © 
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(c) Adding Two Virtual (d) A Modified Channel-dependence 
Channels (V3, V9 Graph using the Virtual Channels 


Fig. e Deadlock Avoidance using Virtual Channels to Convert a 
ycle to a Spiral on a Channel-dependence Graph 
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ive channel bandwidth available to each request may be decreased 
rtual channels. There exists a tradeoff between network 
mmunication latency in determining the degree of using 
plement a great amount of virtual channels, a high- 


The effect 
by using the vi 


throughput and co 
virtual channels. To im 


speed multiplexing is needed. 

0.30. Explain virtual networks and subnetworks. (R.GP. 
Fig. 4.25 (a) depicts a mesh with dual virtual channels along both 
dimensions. The virtual channels may be used to produce four possible virtual 
networks. Fig. 4.25 (b) shows the virtual network used for east-north traffic. 

In the same way, we can construct three other virtual networks for other 
traffic orientations. It is noted that no cycle is possible on any of the virtual 
networks. Hence, deadlock can be completely prevented when X-Y routing is 
implemented on these virtual networks. 

Any two of the four virtual networks can be used simultaneously without 
conflict when both pairs between adjacent nodes are physical channels. If 
only one pair of physical channels is shared by the dual virtual channels between 
adjacent nodes, then only (b) and (e) or (c) and (d) can be used at a time. 

Different combinations, such as (b) and (c), or (b) and (d), or (c) and 
(e), or (d) and (e), cannot coexist simultaneously because of shortage of 


V., June 2010) 


Ans. 


channels. 


(a) A Dual-channel (b) East-north Subnet (c) East-south Subnet 


4x4 Mesh 


Fig. 4.25 (Ù) West-north Subnet (e) West-south Subnet 
-4.45 Four Virtual Networks Implementable from a Dual-channel Mesh 


1 

T 
p 
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annels to the network will increase the adaptivity in making 
gh, the increased cost can be appreciable and hence 
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Increase in ch 
routing decisions. Althou 


se of redundancy is avoided. 
t communications, the concept of virtual networks result in 


iven physical network into logical subnetworks 
g of a 7 x 9 mesh into four subnets for ą 
ode (4, 3). Shaded nodes are along 


the u 

For multicas 
the partitioning of ag 
Fig. 4.26 shows the partitionin 
multicast communication from source n 


the boundary of adjacent subnets. 
Assume source node (4, 3) wishes to transmit to a subset of nodes in 


7 x 9 mesh. The mesh is divided into four logical subnets. All traffic 
t and north uses the subnet at the upper right corner. In the 
ructs three other subnets at the other three corners 


the 
heading for eas 
same way, we can const 


of the mesh. 
Nodes in the third row and fifth column are along the boundary between 


subnets. Infact, the traffic is being directed outward from the center node (4, 3), 
In this partitioned mesh, there is no deadlock when an X-Y multicast is 


performed. 
Likewise, one c 


an divide a binary n-cube into 2"-! subcubes to give 


deadlock-free adaptive routing. For the bidirectional network, each subcube 


has n + 1 levels with 2” virtual channels per level. It has been observed that for 


low-dimensional cubes (n = 3 to 4), this method is best for general-purpose 


routing. The number of required virtual channels increases ar 


value of n. 


je West 


Fig. 4.26 Partitioning of a 7 x 9 Mesh 


bitrarily with the 3 
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associated with multicomputer 


Q.31. Explain the following terms 
networks and message passing mechanisms — 
(i) Message, packets and flits 
(ii) Store and forward routing at packet level 
(iii) Wormhole routing at flit level 
(iv) Buffer deadlock versus channel deadlock 


(v) Virtual channels versus physical channels 
(vi) Blocking flow control in wormhole routing 


Buffering flow control using virtual cut-through routing. 


(vii 
(R.GEV., June 2010) 
Ans. (i) Message, Packets and Flits — Refer to Q.21 and Q.23. 

(ii) Store and Forward Routing at Packet Level — Refer to Q.20. 


(iii) Wormhole Routing at Flit Level — Refer to Q.20. 

(iv) Buffer Deadlock versus Channel Deadlock — Refer to Q.26. 

(v) Virtual Channels versus Physical Channels — Refer to Q.27. 
(vi) Blocking Flow Control in Wormhole Routing — Refer to Q.26. 


(vii) Buffering Flow Control using Virtual Cut-through Routing — 


Refer to Q.26. 


Q.32.-What is vector processing ? Give some examples of vector 


processing. Also, discuss some primitive vector processing instructions. 


(R.GP.V., June 2006, Dec. 2016) 
Or 
Write vector processing principles. (R.GP.V., June 2012) 


Ans. A vector processor or vector computer is a machine designed to 


m 


ae control arithmetic operations on elements of arrays, known as 
Baa This type of machines are especially useful in high-performance 
ific computing, where matrix and vector arithmetic are quite common. 


A vector operand has an ordered set of n elements, in which n denotes 


the | 
sa beep of the vector. In a vector, each elements is a scalar quantity, that 
a character, a logical value, an integer and a floating point number. 


1 


os 4 


we A 


cture ( vi-Sem) 


omputer Archite 
nto the following four primitive types ~ 
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Vector instructions are categorized 1 


Riv 

fv 

fj:VxV>V 

f,:Vx%S V, 

perand and S is a sc 
df, and f, are bina 
found in a mo 


alar operand. The mappings f; and 
ry operations. Some representative 
dern vector processor are listed in 


in which V is a vector O 


f, are 
vector 0 
table 4.1. 


unary operations an 
perations that can be 
Table 4.1 
Description 
B(D < sin (A(D) 


Vector sine 
Vector complement A(I) — A) 


BU) + yA(1) 


max; =1.N A(I) 


E Aw 


Vector square root 


ll} 


Vector maximum S 


S 


1l 


Vector summation 


Vector and C(D = A(1) and B(I) 
Vector larger C(I) = max (A(), B(1)) 
Vector add C(D = A(D) + B(D 


C() = A() * BY) 
C(I) = 0 if A) < BQ) 
C(I) = 1 if AQ) > BID 


Vector-scalar add B() = S + AQ) 
Vector-scalar divide B(I) = A(D/S 


Vector multiply 
Vector test > 


of vector data, some special instructions may 


a boolean vector may be produced which 
perations 


To easier the manipulation 
be used. By comparing two vectors, 
can be used as a masking vector for enabling or disabling component o 
in a vector instruction. A compress instruction will shorten a vector under the 
control of a masking vector. A merge instruction is used to combine two 
vectors under the control of a masking vector. Compress and merge are special 
f; and f} operations since the resulting operand may have a length distinct 
from that of the input operands. Fig. 4.27 shows the pipelined implementation 
of the four basic vector operations. 
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vi V1 vı v2 RERS vı 
V2 S V3 V2 
wf V> Of: Vj>8 OR:VV2>V3 Mfa:S*V1> V2 
Fig. 4.27 Four Vector Instruction Types for Pipelined Processor 
To characterize these special vector operations, some examples are given 
below — 
Examples — (i) Assume P= (4, 2, 8, 6) and Q = (3, 5, 7, 9). The boolean 


vector B = (1, 0, 1, 0) is produced after the execution of compare operation 
B=P>Q. 
(ii) Assume P = (1, 2, 3, 4, 5, 6, 7, 8, 9) and B= (0, 1, 1, 0, 1, 0, 0 
1, 0). The compressed vector Q = (2, 3, 5, 8) is produced after the TOMDTESS 


operation Q = P (B) is executed. 

(iii) Assume P = (1, 4, 6, 8), Q = (2, 3, 5, 7, 9) and B = (1, 0, 0, 1 
0, 1, 0, 1, 0). The merge vector M= (1,2, 3, 4, 5, 6, 7, 8, 9) is produced after 
the merge operation M = P, Q, (B). The first 1 in B specifies that M (1) is 
chosen from the first element of P. Likewise the first 0 in B specifies that M 
(2) is chosen from the first element of Q. 


0.33. What is vector processing ? Explain the characteristics of vector 


processing. (R.GPRV., June 2011) 


Ans. Vector Processing — Refer to Q.32. 
Generally, machine operations suitable for pipelining should contain the 
following three characteristics — 

He (i) Operations that are executed by different pipelines should be 
e to igh expensive resources, like memories and buses, in the system. 
ae (ii) eater processes (or functions) are repeatedly called number 

s, each of which can be subdivided into s 
subfunctions). g es 
ei ki Successive operands are fed through the pipeline segments and 
i ew buffers and local controls as possible. 
ora above given characteristics describes why most vector processors 
Pipeline structures. Vector instructions required to carry out the same 


itecture (VI-Sem) 


peatedly. For scalar processing, it is not trye 
advantage of vector processing over s ae 
thead due to the loop-control mechanism; 
tter with longer vectors because of the 


e Computer Arch 


arious data sets re 
r of operands. An 
reduction of the ove 
r should perform be 
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1 


operation on vat 
over a single pa! 
processing is the 
A vector processor i 
startup delay in a pipeline. 

‘ 0.34. Explain the various pipeline 


(oc ii 


vector processing methods. 
(R.GP.V., June 2016) 


peline vector processing methods as follows — 


Ans. There are three types of pi 

i 0, Horizontal Vector Processing Method — All components of the 

| vector y are calculated in a sequential manner, Yj for i = 1, 2, 3, ..... m in this 
x n 

| method. Including (n- 1) additions each summation Yi= Dieta must be 


. n 
tchingto the next summation yj+1 = Diet Zj41, j evaluation, 


completed before swi 
eded to evaluate each y;. The total add time for m 


Clock periods (n + 14)t is ne 
outputs equals, as given below — 
Ty (horizontal) = (mn + 14m) t 
Horizontal vector processing method is frequentl 
The above sequence of computations equivalents to the 
given that all initial values of y; for i = 1, 2, 3, 4, 


(i) 


y employed in a scalar 


pipeline processor. 
following Fortran program, 
wey D, are set to Zero 
DO 100i = 
DO 100 j 
yj = Yj +a * Xi 
100 CONTINUE 
On a vector processor, the speedup of this horizontal pipelining over 
serial processing in a uniprocessor is given as — 
Ti t(10mn — 5m) 


= ————mmmmM 


Tm +T, (horizontal) t(2mn +14m +9) 
_ _10mn-5m 
ý ~ 2mn+14m+9 
(ii) Vertical Vector Processing Method - The additions sequence 

with respect to the m-by-n array is given below — 

_ Step I- Through the pipeline, calculate the partial additions (Zi + Za) = 
yjz for i= 1, 2, 3, 4......., M, sequentially. 
R D — By loading yi, into one input port in step I and loading 23 into 
ae | input port, calculate the partial additions (yi2 +Zi3) fori = 1,2,3, 


aN 


il 
= 
` 

=- 
£ 

= 


> Shorizontal T 


(ii) 
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> Step M to Step (n—1)-By feeding successive columns (z; j>Z2j Zm)" 
r» Zmj 


= 4, 5, 6, -s D into the second input port, repeat step II for n — 3 times. At 


for j 
the end of step (n— 1), the values of y; for i = 1, 2, 3, 4, ........, m emerge from the 


pipeline. l . 
The total add time of this vertical method is given as — 
T, (vertical) = (mn - m + 10) t i 


Hence, the speedup of vertical vector processing over uniprocessing is 


_ 10mn-—5m 
(iv) 


iven as 
sive T; 
T, (vertical) +T, 2mn-m+19 


Svertical z 


In the STAR-100, this method has been applied to vector processing 
(iii) Vector Looping Method — Vector looping method combines the 
vertical and horizontal vector processing method into a block method. Th 
steps are given below — Ba 
Step I — The vertical processing method is appli 
pplied to produce th 
block of five outputs, Y1; Y2 Y3» Y4» Ys in column order. fa 
` Step HI to Step k — Repeat step I for producing the remaini 
blocks as given below — : pete ee ae 
Step II — Yo, Y} -=--> Vig 
Step III — Yy1> Vids eee Yis 


Step k- Y5k - 4» Y5k— > vesee eee VG Ke 

Step k + 1 — Repeat step I for producing the last block of r outputs 
Ysktl> Ysk+2» «ees aNd Ysgar Where m = 5k+rand0<r<5, 
(m-1) 

T, (vector looping) = 

5 ping) = St + (Sn — 1)t + (k — 1) [5(n— 1) t] + 5t 
m = mnt — mt — nrt + 14t + rt (v) 
e speedup of th - i i ? 

te “a i p e vector-loop method over a uniprocessing method, is 


The total add time of this approach equals, where k = 


ee een ee 
ector looping Tm + Ta (loop) 2mn-m-m+r+23 A (Vi) 


F . . 
or segmented vector processing, this method has been applied in the Cray-1. 


.35. i 
Q.35. Explain vector and scalar balance point related to vector processing. 
(R.GP.V., Dec. 2014) 


A 
pe ote and scalar balance point is used to develop the future 
performance E apea It is used to maintain a good vector/scalar 
supercomputer ¢ e. The vector and scalar ratio should be balanced ina 
o separate the hardware resources with different speeds 


vit 


1 


t 


aW A 
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which are dedicated to concurrent vector and scalar operati 
considering the scalar ratio, the processing is indispensabl e for general 
architecture. Where as the vector processing 1s needed for regularly a 
parallelism in scientific and engineering computations. These two S ured 
computations should be balanced. Pes of 
The percentage of vector code in a program is known as vector bala 
pointand it required to achieve equal utilization of vector and scalar bard, - 
In other words we can say that the equal time spent in vector ang is 
hardware so that no resources will be idle. With the help of vector/scala, 
balance point technology the supercomputers are targeted towards the large. 
scale scientific and engineering problems. In addition to this the 
supercomputer highly performed and they must be programmable os 
accessible in a multiuser environment. The architecture of the supercomputer 
are derived by the highest performance in a variety of respect including 
memory, I/O performance, capacities and bandwidths in al] 


Ons, While 


processor, 
subsystems. 
For exampl 
megaflops in scalar mode, the 
be considered, if the 90% co 


e if a system is capable of 9 megaflops in vector mode and | 
n the equal time will be spent in each mode can 


de is vector and 10% code is scalar then the 


resulting vector/scalar balance point will be 0.9. It may not be optimal fora 
system to spend equal time in vector and scalar modes. However, the vector 
balance point should be maintained sufficiently high, matching the level of 


vectorization in user programs. 


ther, scatter and masking instructions related 
(R.GPRV, June 2017) 


ctions randomly collect the 
gathering 


0.36. Explain the term ga 


to vector processing. 


Ans. (i) Gather Instruction — These instru 


vector elements from the memory. It uses two vector register for 


vector elements. The gather function is 
F, : M > VR, * VR? 
Here M = Memory, VR = Vector register. 


It fetches vector elements from memory to vector registers. 


(ii) Scatter Instruction — These instructions randomly spread the 
isters for scattering 


vector elements all over the memory. It uses two vector regi 
vector elements. The scatter function is 


F: VR} x VRo > M 
In this VR, contains the data and VRọ contains the index of 
from which gathering or scattering is done. 


memory 
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qii) Masking Instructions — To increase or decrease the size of vector 
s used which converts a vector into larger or smaller indexed vector. 


is mask vector, which is used to mask the VRọ into VR}. 


Here, VRn 
nany types of vector instruction are there ? 


, How # 
Q.37. (R.GPV., June 2016) 


Ans. Refer to Q.32 and Q.36. 
38. Define vector processing and its instruction types. Also, explain 
E scatter and masking instructions in Cray microprocessor. 
athe, 


Ans. Refer to Q.32 and Q.36. : 


0.39. Write short note on vector operand specification. 

Ans. There may have arbitrary length of vector operands. It is not necessary 
that vector elements are kept in a contiguous memory locations. For instance, 
the entries in a matrix may be kept in row major order or in column major order. 
Each row, column or diagonal of the matrix can be used like a vector. 

The column elements must be kept with a stride of n, when row elements 
are kept in contiguous locations with a unit stride, where n represent the matrix 
Likewise, the dizgonal elements are also separated by a stride of n + 1. 
One must speci , the vector base address, length and stride to access a 
vector in mers ry. Only a segment of the vector can be loaded into the vector 
register in a fi ced number of cycles because each vector register contain a 
fixed number əf component registers. Long vectors are segmented and 
processed one segment at a time. 

To permit parallel or pipelined access, vector operands should be kept in 
memory. To allow rast vector access, the memory system for a vector processor 
must be specifically Jesi ned. The access rate should match the pipeline rate. In 
fact, the access path «s « ften itself pipelined and is known as access pipe. 


£ 


orde. 


0.40. Describe different vector-access memory organization schemes. 
Or 
Explain the following memory organizations for vector accesses — 
i) C-access memory organization 
(ii) S-access memory organization 
(iii) C/S-access memory organization. 


Or 
W . 
Fite short note on vector access memory schemes. (R.GP.V., June 2011) 
Or 


= lai 
Explain vector access memory schemes. (R.GP.V., June 2015, Dec. 2015) 


O E 


1: 


a 
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Ans. There are three vector-access memory organization sche 
; m 
interleaved memory modules allowing overlapped memory accesses es from 


(i) C-access Memory Organization — Fig. 4.28 and 4.29 ‘i 
that the m-way low-order interleaved memory structure permits m ust 
o be accessed concurrently in an overlapped manner. This ae 
en known as C-access memory scheme as shown in fig a 
staggered in various memory modules. The lov.. 28, 
and the high-order b bits choose the word ay 
= n is the length of address. in 


Tate 


words t 
access has be 
The access cycles are 


a bits choose the modules 


each module, where m = 22anda+b 


Module 
Address 


Module 
Address 


Module 


is Address 
Address Buffer Buffer ee Buffer 
d 
Decoder My Mm 


Memory ° Pm 


Address Ss 
Pwora [Module] pi fort OH 
1 ' 1 Li ' 1 
b 
Memory Memory Memory 
Data Data Data 
Word U D Buffer 
Address x i 1 
Buffer 


: 


Fig. 4.28 Low-order m-way Interleaving (The C-access Memory Scheme) 
ddress buffer at the rate 
Effectively it takes 
(memory) 


ssive addresses are latched in the a 
ccess a vector with a stride of 1. 
words, that are equal to one major 


The succe 


of one per cycle to a 
m minor cycles to fetch m 


cycle. 

The successive accesses must be separated by 
access conflicts when the stride is 2. This decreases the me 
by one-half. There is no module conflict when the stride is 3 and the 
throughput (m words) results. 

Generally, C-access will give the maximum throughput o 
memory cycle when the stride is relatively prime to m (number of in 


memory modules). : 


two minor cycles to prevent 


mory throughput 
maximum 


f m words pet 
terleaved 


Lr 


Memory Address Register (6 bits) 


ropiy2zt3i4is | 


Word Module 
Address Address 


Memory Data Register 


(a) Eight-way Low-order Interleaving 
(Absolute Address Shown in Each Memory Word) 


Time 


= ae 


(b) Pipelined Access of Eight Consecutive Words 


in a C-access Memory 


Fig. 4.29 i 
jp, Multiway Interleaved Memory Organization and the C-access 
Timing Chart ; 


(ti) S-acc ; i 
low-order HNA a Memory Organization — Fig. 4.30 (a) shows that the 
Ot Sees dinih aie can be rearranged to permit simultaneous access 
ase, all memory modules are accessed at a time in : 


synchronized m 
anner. Again the hi : 
Word from each module e high-order (n — a) bits choose the same offset 


=a 


a - 
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þe— Fetch Cycle —4 Access Cycle—>| 


Module = 
0 ae 


(n-a) 
High-order 
Address Bits 


Single Word 
Access 


Multiplexer 


Read/Write 
a Low-order 


Address Bits 


(a) S-access Organization for an m-way Interleaved Memory 


Memory 
Modules 


Fetch 3 
Access 2 


Fetch 1 Fetch 2 
Access 1 


Mm-1 


Fetch 2 Fetch 3 
Access 1 Access 2 


Fetch 1 
N 
M2 Access 3 


Fetch 3 
Access 2 


A 
Mi Access 3 


Access 1 


Fetch 3 - 
Access 2 


Fetch 2 
Access 1 


M 
0 Access 3 


m words m words m words 


Time 


Cycle 4 


Cycle 2 Cycle 3 


(b) Successive Vector Accesses using Overlapped Fetch and Access Cycles 


Fig. 4.30 The S-access Interleaved Memory for Vector Operands Access 
ry cycle, m = 2° 


Fig. 4.30 (b) shows that at the end of each memo 
e latched in the data buffers simultaneously. Then, the 


ultiplex the m words out, one per each minor 
(memory) cycles to access m 


Cycle 1 


consecutive words ar 
low-order a bits are utilized to m 


cycle. The minor cycle takes two major 
ecutive words when the minor cycle is selected to be 1/m of the major 


_ Although, if the access phase of the last access is overlapped with the 
fetch phase of the current access, effectively m words take only one memory 
cycle to access as shown fig. 4.30 (b). The throughput reduces, roughly 
proportionally to the stride when the stride is higher than 1. 

(iii) C/S-access Memory Organization — A C/S-access memory 
organization combines the C-access memory organization and S-access 
memory organization. Fig. 4.31 depicts this scheme, where n access buses 


cons 
cycle 
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ilized with m interleaved GBEN modules connected to each bus. O 

h bus, the m-modules are m-way interleaved to permit C-access Th ; 
eac . parallel to allow S-access. If the n buses are fully utilize p en 
mory accesses then maximum m.n words are fetched pi i 


mar 
v 
Q 
= 
EN 
2g 
5 
ez 
26 
= 
ELS 
es 
2 
Nn 
Da] 
nN 


Memories 


Fig. 4.31 The C/S Memory Organization 


The C/S-access memory organization performs better in vector multiprocessor 
figurations. It offers parallel pipelined access of a vector data set with high 
idth. To ensure smooth data movement between the memory and multipl 
a special vector cache design is needed within each soda: 


con 


bandw: 
vector processors, 


0.41. Write short note on parallel matrix multiplication. 
(R.GP.V., June 2010) 


Ans. Matrix multiplication is one of the most computational intensive 


of two nxn matrices consists of n* inner products or n? multiply —add operations 


An nxm matrix of numbers has n rows and m columns and may be considered 
as constituting a set of n row vectors or a set of m column vectors. Consid 
for example, the multiplication of two 3x3 matrices A and B l a 


uo 412, 3} [bu biz yg} [cu c&n cg 
a 

21 ā2 a2 j|xļb23 b} b3 J=] en, c} c3 
a a 

31 a32 a33] [| b3; b32 b33ļ. [c31 `C32 €33 


The product matrix C i ! 
is a 3x3 matrix whose elem 
elements of A and B by the inner product — a ae 


3 
Si? me aik X by; 
k=] 


For exam 
ple, the number i 
calculated by letting i = 1, j = a a ant row and first column of matrix C is 


11> aibi; + aj2b721 ay a3b3] 


SL R 
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This requires three multiplications and (after initializing C11 to 0) 4 
additions. The total number of multiplications or additions required to compy 
the matrix product is 9 x 3= 27. If we consider the linked multiply _ = 
c +a x b as a cumulative operation, the product of two nxn matrice, 


operation ; . 
aie n3 multiply-add operations. The computation consists of p2 inn 
with each inner product requiring n multiply —add operations, pte 


that c is initialized to zero before computing each element in the Product Matrix 
In general, the inner product consists of the sum of k product terms of 


hreg 


form — 
the fo C =A;B; +A2B2 +A3B3 + A4B4 +....A,B, 


Adder Pipeline 


Multiplier Pipeline 


In a typical 
application k may be [Source A | 
equal to 100 or even oae] 


1000. The inner product 
calculation on a pipeline Fig. 4.32 Pipeline for Calculating an Inner Product 


vector processor is i 
shown in fig. 4.32. The values of A and B are either in memory or in processor 


registers. The floating- point multiplier pipeline and the floating-point adder pipeline 
are assumed to have four segments each. All segment registers in the multiplier 
and adder are initialized to 0. Therefore, the output of the adder is 0 for the first 
eight cycles until both pipes are full. A; and B; pairs are brought in and multiplied at 
a rate of one pair per cycle. After the first four cycles, the products begin to be 
added to the output of the adder. During the next four cycles 0 is added to the 
products entering the adder pipeline. At the end of the eighth cycle, the first four 
products A,B, through A4B4 are in the four adder segments, and the next four 
products, AsBs through AgBg, are in the multiplier segments. At the beginning of 
the ninth cycle, the output of the adder is Ay B, and the output of the multiplier is 
AsBs. Thus, the ninth cycle starts the addition A,B, + AsBz in the adder aaa? 
The tenth cycle starts the addition AB) + AgBg and so on. This pattern brea 
down the summation into four sections as follows — 


C =A];B; + AsBs + AgBg + A13B13 +...+ A2B2 + A6B6 + AyoByo | 


+ Ay4Bi4 parE ni A3B3 + A7B7 + Ay By + A15By5 Ht A4B4 
+ AgBg + A12B12 + A16B16 +... , 
When there are no more product terms to be added, the system inserts 
four zeros into the multiplier pipeline. The adder pipeline will then have one 
partial product in each of its four segments, corresponding to the four sums 
listed in the four rows in the above equation. The four partial sums are then 
added to form the final sum. 
0.42. Write short note on vector supercomputers. 


Ans. Supercomputers are very a powerful, high-performance sae P 
used mostly for scientific computations that cost millions of dollars. The 


nes 
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uters are characterized by its high computational speed, fast and large 
comp d secondary memory and the extensive use of parallel structured 
n an Supercomputers also use special techniques for removing the heat 
D s to avoid them from burning-up because of their close proximity, 
from ele omputers are designed to perform large-scale vector or matrix 
Today tions in the areas of structural engineering, aerodynamics, petroleum 
eh meteorology, VLSI circuit design; hydrodynamics, ‘artificial 
pa site , nuclear research, and tomography. The requirement of high speed 
intel pa internal memory is obvious in these scientific applications. 
ees puters provide a sonia amount of raw computing power that 

rocesses a large amounts of data, The data elements are organized in array, 
ae or matrix forms. Today’s supercomputers should be capable to work 
at a speed of 100 megaflops or higher. 

In the 1060s, the first generation of vector supercomputers is marked by 
the development of the TI-ASC, Star-100 and the Illiac-IV. There were seven 
installations of ASC, four installations of Star-] 00 and only one Illiac-IV system 
installed at user sites by 1978. To accomplish parallel vector processing, both 
the ASC and the Star-100 systems are equipped with multiple functional pipeline 
processors. In pipeline mode, the ASC can manage upto 3-dimensional vector 
computations. The Star-100 contain a memory-to-memory architecture with 
two pipeline processors. The maximum speed of both systems is near 40 
megaflops. With the development of the Cray-1, the Cyber-200 series, the Fujitsu 
VP-200, the NEC SX series and the Hitachi-820 series, vector processors entered 
the second generation. The Cray-1, evolved from the CDC 6600/7600 series, is 
considered one of the fastest supercomputers that has ever been made. When all 
the resources are fully utilized the maximum CPU rate of the Cray-1 is 160 
megaflops. The Cyber-200 series is extended from Star-100. The Cyber-205 
contain both scalar and vector pipelines, with the potential to perform 800 
megaflops. There were over 60 Cray-1 and Cyber-205 machines installed all 
over the world as of september 1982. Recently, Fujitsu in Japan announced a 
vector processor (VP-200), that can perform upto 500 megaflops. 

It is highly required to have a vector processor that can perform 1000 
megaflops or more in future. Cray Research is currently extending the Cray- 
1 to a multiprocessor configuration known as Cray X-MP. This Cray X-MP, 
consisting of dual processors, is expected to be five times more powerful as 
Compared to the Cray-1, with an expected peak speed of 400 megaflops. 
Eventually, Cray Research plans to further upgrade Cray X-MP to a four 
Saad model, known as Cray-2, that will be 12 times more powerful than 
er a t r Cray-1 in vector processing mode. CDC has proposed to upgrade 
fo see 05 eventually to a vector processor which can give 3000 megaflops 

erical aerodynamic simulations. 


mai 


eS) ae 
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n compiler related to vector Process; 
Sin 
(R.GEV, Dec, 2) iy 


Ans. In DO loops, 4 vectorizing compiler analyzes whether instruct 
may be executed in parallel. Vectorizing compiler produces object ae ion 
vector instructions. When higher the vectorization ratio, then the higher 
be. the performance. To obtain this, the compiler vectorizes complicated ia 
accesses and redesign sequences of program, subject to machine hardu ta 
ints. Barriers to vectorization exist in conditional and branch Stateme e 
pendencies, nonlinear and indirect indexing, and subroutine a 


0.43. Explain the vectorizatio 


constra 
sequential de 


within loops. . . f 
The parser and scanner is not require to modified in a vectorizing Fortran 


compiler. With the scalar code already in position, to translate scalar operations 


series into vector code. 
The structure of Fo 

vectorizing compiler. Usually, th 

will be improved. 

This is because of 


rtran programs being compiled, will be tested bya 
e higher the vectorizing ratio, the performance 


Vectorizing Ratio 


1, 2,3 


SS i = Intrinsic Scalar O i 
the fact that vector gob dca NN (not veclorkahey 
: ector 
speed is much better NS œ Execution in Vector Speed _ 
as compare to scalar SSS Vector Form Scalar Speed | 


i tor 
speed aes Vee Complexed 
processor. Consider Vectorization 
the speed ratio of 


Ss 


Partially “A Complexed w 
vector to scalar vectorized |i Pee aildi V 
: Code f p ector Form 
operations to be 50 
as shown in fig. simple 


4.33. The shaded Vectorization 
regions correspond 
to decreased execu- Scalar Code 
tion time for vector 
instructions. As 


Complexed 
Operation 


Operation 


100% 


50% 
represented, from Fig, 4,33 Representation of Vectorization Ratio and 


50 scalar processes, 
one vector process is 
decreased. Fig. 4.33 represents two levels of vectorization. 
arithmetic and simple scalar operations such as inner product, random-acct 
integer operations can be vectorized simply at the first level for a vectoriZine 
ratio of 50 percent. The remaining complexed operations such as condition! 
statements, scatter, gather and others can only be vectorized by a very intellige™ 
compiler. This intelligent compiler can efficiently access complexed dal 
structures and tune branch-disturbed program structures. 


Saving in ExecutionTime 


Simple D0 
m-access | 
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i distributed memo d 2 
44, Describe the distriot ry model and shared me 
on the memory distribution and addressing schemes of. SIMD a 
er 
(R.GPV, Dec. 2004) 


pased 
mode L Or 


in the different SIMD computer models based on t 


Expla he memory 
qistributior fi (R.GEBV, Dec. 2007) 
a Aüs SIMD” computers appear in two basic architectural Organizations 


depend 


2 and addressing scheme used. 


s on the memory distribution and addressing schemes. 


(i) Distributed Memory Model- A SIMD computer in which each 
essing element has 1ts own local memory is known as distributed mere 


proc i E OG 
model. Fig. 4.34 depicts a distributed memory model for a SIMD computer. 


This configuration is structured with N synchronized processing elements 
(PEs); all of which are under the control of single control unit. Each PE. is 
essentially an arithmetic logic unit (ALU) with connected working registers an 
local memory PEM; 
for the storage of 
programs. From an 
external source, the 
user and system 
programs are stored 
in the control unit 
(CU) memory. The 
function of CU is to 
decode all the instruc- 
tions and determine 
where the decoded 


instructions should 
be executed. Solar Fig. 4.34 Distributed Memory Model 


control-type instructions are directly executed inside the CU. Vector instruction 


are broadcast to the PEs for distrib i 
t uted execution to obtai i i 
through duplicate arithmetic units (PEs). Bo 


I/O Data and Instruction 


Data Bus 


o 


Control 


“enuusonunoecnnuune 


Interconnection Network 


All imi 
te oe perform the similar function synchronously in a lock-step 
aii eck. e command of the CU. Before parallel execution in the array 
> operands are distributed to the PEMs. The distributed data can 


be stored i 
a ~ n PEMs from an external source through the CU or through the 
n a broadcast mode using the control bus. 


(ti) Sh 
processing ared Memory Model — A SIMD computer, in which all 


el 
model as ipid share a common-memory is known as shared memory 
in fig. 4.35. This configuration differs from the distributed 


memory model į 
eli - 
n two aspects. First, the local memories connected to the PEs 


AOO 


z 
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are substituted by parallel memory modules shared by all the PEs 
alignment network and second, the inter-PE permutation network is sy 
by the inter-PE memory I/O Data and Instruction 
alignment network, that is again 
handled by the CU. There are P 
memory modules and NPEs in 
fig: 4.35. The two numbers-are 
not egsentially equal. In fact, 
they have been chosen to be 
relatively prime. A path- 
switching network between the 
PEs and the parallel memories 
is a alignment network, Such 
an alignment network is 
required to permit conflict-free 
accesses of the shared memo- 
ries by as many PEs as possible. 


Using an 
bstituteg 


Data Bus 


we neennnnnn., Contr] 


Fig. 4.35 Shared Memory Model 
0.45. Explain about distributed memory model.(R.GPV, June 2014) 


Ans. Refer to Q.44 (i). 
0.46, What is shared memory model ? 


Ans. Refer to Q.44 (ii). 


0.47. Gompare distributed memory model and shared memory model, 
— (R.GP.V, June 2010, 2011) 


(R.GPV, June 2015) 


Or 


Differentiate between distributed memory model and shared memory 


(R.GBV, Dec. 2017) 


model. 
fference between distributed memory model 


Ans. There are several di 
and shared memory model as follows — 
Shared Memory Model | Distributed Memory Model 


Relatively simple to per-| Relatively difficult to per- 
form. Reward versus | form. Tends to need more 
effort changes widely.] of an all-or-nothing effort. 


Easy parallel algorithms Significant extra overhead 
are simple and fast to | and complexity even for 
implement. Implemen- implementing easy and 
tation of highly scal- | localized parallel cons- 
able complex algori- | tructs. 
thms is supported but 
. | more included. 


(i) Capability to parallelize 
small portions of an 
application at a time 

(ii) Additional complexity 
over serial code to be 

addressed by progra- 

mmer 


b 
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Does not need a special 
compiler. For the target 
computer, only a library is 
needed, and these are use- 
ally available. Debuggers 
are not easy to implement 
since a direct, global view}: 
of all program memory is 
not available. 


Needs a special com- 
piler and a runtime 
library which supports 
openMP. Well-written 
code will run and com- 
pile correctly on single 
processor without an 
openMP compiler. De- 
bugging tools are an 
extension of are an 

existing serial code de- 
buggers. Single memory, 
address space simpli- 
fies development and 
support of a rich de- 
bugger functionality. 

Generally, needs a small} Tends to need extra copy- 
increase in code size 2} ing of data into temporary 
to 25% depending on | message buffers, resulting 
needed for parallel sca-| in a significant amount of 
lability, Code readability} message handling code. 
needs some knowledge} Programmer is generally 
of shared memory con-| faced with extra code 

structs, but is other-| complexity even in non- 
wise managed like dire- | performance-critical code 
ctives embedded within | segments. Code readability 
serial code. suffers accordingly, 


Currently, few vendors| Most vendors give the 
give scal-able shared | capability to cluster non- 
memory systems. shared memory systems 
with moderate to high-per- 
formance interconnects. 


(iv) Impact on code quan- 
tity and code quality 


(v) Feasibility of scaling 
an application to a 
large number of pro- 
cessors 


.48. Di i ; 
A na in detail about the performance issues in symmetric and 
shared memory architectures. (R.GPV., Dec. 2017) 


Ans, Ther 

5 i f 

Bica are several performance issues in symmetric and distributed 
ory architectures as follows — 

i 

(i) Data Access and Communication — 


a) T i C 
: : a memory hierarchy (caches and main memory) plays a 
etermining communication cost — 


l . . . 
(1) May easily dominate the inherent communication of the 


Significant ro] 


algorithm, 


1 
98 Advance Computer Architecture (VI-Sem) 


(b) For uniprocessor, the execution ti 

2 > cution ti m 

by useful work time + data access time — me of a program is given 
(1) Useful work time i 

AE A is normally called the busy time öt 
(2) Data access time can be reduced ei 

s ther by archi 
techniques (e.g., large caches) or by cache-aware algorithm dics iha enin 
spatial and temporal locality. tat exploits 


(ii) Data Access — In multiprocessors — 
(a) Every processor wants to see the memory interface as it 
its 


own local cache and the main memory. 
(b) In reality, it is much more complicated. 
(c) If the system has a centralized memory (e.g., SMPs), there 


are still caches of other processors; if the memory is distributed then some 


part of it is local and some is remote. 
(d) For shared memory, data movement from local or remote 


memory to cache is transparent while for message passing it is explicit. 
(e) View a multiprocessor as an extended memory hierarchy 


where the extension includes caches of other processors, remote memory 


modules and the network topology. 


(iii) Artifactual Communication — 
(a) Communication caused by artifacts of extended memory 


hierarchy — 
the cache or local memory 


(1) Data accesses not satisfied in 


cause communication. 

(2) Inherent co 
determined by the program. 

(3) Artifactual communication is caus 
ories, unnecessary data in a trans 
dent transfer granularity, 
capacity (in cache or memory). 

s infinite capacity and 


mmunication is caused by data transfers 
ed by poor allocation 


of data across distributed mem 

transfers due to system-depen redundant 

communication of data, finite replication 
(b) Inherent communication assume 


perfect knowledge of what should be transferred. 
(iv) Capacity Problem — Most probable reason 


communication — 
(a) Due to finite capacity of cach 


for artifactual 


e, local memory Or remote 
memory. | 
(b) May view a multiprocessor as a three-level memory hierar y 
for this purpose — local cache, local memory, remote memory. 

(c) communication due to cold or compulsory 


inherent communication are independent of capacity. 


misses and 


fer, unnecessary 
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misses generate communication 


ote depending on the 


te capacity. 
resulting fiom A Generated traffic may be local or rem 
allocation of Pin neral technique — Exploit spatial and temporal locality to 
he properly. ; IRs UPAR 
use the ps vapor Locality — Maximize reuse of data — 
y (a) Schedule tasks that access same data in close succession. 

(b) Many linear algebra kernels use blocking of matrices to 
. prove temporal (and spatial) locality. 
imp Example — Transpose phase in Fast Fourier Transform (FFT); to improve 
locality, the algorithm carries out blocked transpose, i.e. transposes a block of 
data at a time. 


LA 
Benn 
T N 
Block Transpose TITS 


Fig. 4.36 
i (vi) Spatial Locality — Consider a square block decomposition of 
grid solver and a C-like row major layout i.e. A[i] [j] and A[i] [j + 1] have 


contiguous memory locations. 


Memory The same page is local to a 
Allocation processor while remote to 
others; same applies to 
straddling cache lines. 
Cache line Ideally, I want to have all 
pages within a partition 
local to a single processor. 
Standard trick is to covert 


Page 


Page Straddles C i 
Partiti ache Line 
artition Boundary Across Partition the 2D array to 4D. 
Fig. 4.37 


(vii) 2D to 4D C, j i 
Dem A a onverston — Essentially you need to change the way 
(a) The matrix A 
The needs to be allocated i 
Ge a partition are contiguous Cy ae 
e fi . . i 
row and column indices te = dimensions of the new 4D matrix are block 
€., for the partition assigned to processor P 6 these 


are 1 and 2 i 
respectively (assuming 16 processors) 


(c) The ne i 
xt two dimensions hold the data elements within that 


elements falli 


Partition, 
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(d) Thus the 4D array may be declared as float BIVPIivpy 
/ 


vP][N/vP]. 
][N (e) The element B[3][2][5][10] corresponds to t 
10th column, 5th row of the partition of P14. 


(f) Now all elements within a partition have contigu 
Ous 


addresses. } 
~ (viii) Transfer Granularity — 


(a) Determine transfer data in one communication — 
(1) For message passing, it is explicit in the 
(2) For shared memory this is really under 


the cache coherence protocol — there is a fixed size for which tr 
defined (normally the block size of the outermost level of cac 


(b) In shared memory you have to be careful — 
(1) Since the minimum transfer size is a ca 


may end up transferring extra data e.g., in grid solver the elements of the left 
and right neighbours for a square block decomposition (you require only one 
element, but must transfer the whole cache line) — no good solution, 


(ix) Worse and False Sharing — If the algorithm is designed so Poorly 


he Clement iu 


Program, 


the control of 
ansactions are 
he hierarchy), 


that — | 
(a) Two processors write to two different words within a cache 
line at the same time. 


(b) The cache line keeps on moving between two processors, 


(c) The processors are not really accessing or updating the same 
element, but whatever they are updating happen to fall within a cache line is 
false sharing. 


(d) For shared memory programs false sharing can easily 
degrade performance by a lot. 


(e) Easy to avoid — Just pad up to the end of the cache line 


‘before starting the allocation of the data for the next processor (wastes memory, 


but improves performance). 
(x) Contention — 


(a) It is very easy to ignore contention effects when designing . 


algorithms — 5 
(1) Can severely degrade performance by creating 
spots. . vi 
(b) Location hot-spot — Consider accumulating a ee i 
the accumulation takes place on a single node i.e., all nodes a A 
allocated on that particular node whenever it tries to incre 


ae 
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CA on this node 
becomes a Bottleneck 


cy, 


Scalable Tree Accumulation 
Fig. 4.38 


(xi) Hot-spots— l 
(a) Avoid location hot-spot by either staggering accesses to the same 
location or by designing the algorithm to exploit a tree structured communication. 
(b) Module hot-spot — 
(1) Normally, happens when a particular node saturates 
handling too many massage (need not be to same memory location) within a 
short amount of time. 


(2) Normal solution again is to design the algorithm in such 
a way that these messages are staggered over time. 


(c) Rule of thumb — Design communication pattern such that it 
is not bursty; want to distribute it uniformly over time. 


(xii) Overlap — 


(a) Increase overalap between communication and computation — 

(1) Not much to do at algorithm level unless the 

programming model and/or OS provide some primitives to carry out 
prefetching, block data transfer, non-blocking receive etc. 

(2) Normally, these techniques increase bandwidth demand 

up communicating the same amount of data, but in a shorter 

xecution time hopefully goes down if you can exploit overlap). 


Q.49. Compare the advantages and shortcoming in implementing 
Private virtual memories and a globally shared virtual memory ina 
multicomputer system, (R.GPV, Dec. 2016) 


Ans. Private Virtual Memory — In this model, a private virtual memory 
Space is connected with each processor, which is divided into pages. From 
different Virtual spaces, these pages are mapped into same physical memory 
Which is shared by all processors. 

Advantages — No locking is required to ensure consistency of data 
USE it uses protection on per-process or on each page basis, private 
mory mapping and also the small address space of processors. i 
Disadvantages — (i) There may be chances of pointing same physical 
by different virtual addresses. This problem is called synonym problem. 
, Gi) It iş also possible that in main memory, different pages may be 
Pointed by same virtual address in different virtual space. 


because you end 
amount of time (e 


beca 
me 


page 
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Globally Shared Virtual Memory — In this model. a si : 
Space combines all virtual address spaces. To declare fa cee hi 
shared virtual memory is provided to each processor. Se ets 

Advantages — (i) No synonyms are allowed. 

(ii) Every address is unique in shared memory. 
bite Disadvantages ~= (i) Address translation process-is longer. 
(ii) Locking is required. 


0.50. What is the importance of memory Consistency model ? 
(R.GPV, June 2016) 

Ans. A consistency model refers to the degree of consistency that has t 
be maintained for the shared memory data for the memory to work coitar 
for a specific set of applications. It is defined as a set of rules that applications 
must obey if they want the DSM system to provide the degree of condistensy 
guaranteed by the consistency model. Consistency model allow consistency 
requirements to be relaxed to a greater degree than existing consistency models 
with the relaxation done in such a way that a set of applications can function 
properly. This helps in enhancing the performance of these applications because 
better concurrency may be achieved by relaxing the consistency requirement. 

Consistency model in a multicore system ensures the correctness of 
multithreaded programs. The memory consistency model also affects which 
programmer/compiler and hardware optimizations are legal. 


Q.51. Explain the models of memory consistency. (R.GP. V., June 2016) 
Ans. The various types of consistency models are as follows — 


(i) Sequential Consistency Model — In 1979, this model was 
proposed by Lamport. A shared memory system is said to support the sequential 
consistency model when all processes see the similar order of all memory 
access operations on the shared memory. The exact order in which memory 
access operations are interleaved does not matter. That is, if the three operations 
read (rı), write (w,), read (r2) are performed on a memory address in that 
order, any of the orderings (r, w4, r2), (ti, r2, W1), (Wy, rp To), (Wy, rz Ti), 
(t2, T1, W1), (r2, Wi, rı) of the three operations is acceptable provided all 
processes see the same ordering. 

A DSM system supporting the sequential consistency model may be 
implemented by ensuring that no memory operation is started until all the 
previous ones have been completed. 

(ii) Weak Consistency Model — This model was designed to take 
benefit of the following two characteristics — 

(a) Isolated accesses to shared variables are rare. That is, in 
many applications, a process makes a number of accesses to a set of shared 
variables and then no access at all to the variables in this set of a long time. 


E E E a, a aetna aeee, ae 
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hange in memory done by 
rious write operations 
y required it. 


s not necessary to show the c 
her processes. The results of va 


(b) Iti 
o other processes only when the 


e operation to ot 


rit 
every W : d sent t k $ 
bined an achieved if 
can be com istics imply that good performance may be e 
Both character ry reference operations instea 


of memo 
: is enforced on a group : r 
consister CY e reference operations. Thi : K > 
ofan pipan ar The key problem in implementing this idea is to find 


weal Se, can know that it is time to show the changes peronea T 
u i rocesses since this time is different for different applica - 
cess to other p Į — In 1989, this model was 


s is the basic idea behind the 


a j Mode 
jii) Processor Consistency Me ; 
osed EER The processor consistency model is Dne sa PRAN 
PED i iti triction of memory co : 
odel with an additional res oaeee 


consistency m 
is, a processor C 
consistency mode 


onsistent memory is both coherent and adhere DA 

|. For any memory location, memory coherence menna k a 

all processes agree on the same order of all write operations to that ocaion, 
i tall write operations per. 

3 rocessor consistency guarantees tha > ; 

Ta by all processes in the same order. This 


on the same memory location are seen 1 ; 
o the requirement imposed by PRAM consistency 


requirement is in addition t i i R 
el, Thus if w,2 and w22 are write operations for writing to the same 


memory location x, all processes must see them in the same order— W12 
before W2 Or W23 before w42, for the example given in PRAM consistency. 
It means that both processes p} and py must see the write operations in the 
same order, which may be either [(wy,, W12), (W21, W22)] or [(w21, W22), 
(Wi W12)] for processor consistency. 
(iv) Release Consistency Model—In release consistency model, there 
is a mechanism that clearly tell the system whether a process is entering a 
critical section or exiting from a critical section. As a result the system can 
decide and perform only either the first or the second operation when a 
synchronization variable is accessed by a process. This is done by using two 
synchronization variables rather than a single synchronization variable. Release 
is used by a process to tell the system that it has just exited a critical section, 
so that the system performs only the first operation when this variable is 
accessed. While acquire is used by a process to tell the system that it is about 
to enter a critical section, so that the system performs only the second operation 
when this variable is accessed. Programmers are responsible for placing acquire 
and release at suitable places in their programs. 
Q.52. Discuss the design space for granularity and connectivity of SIMD 
systems. (R.GPV., Dec. 2014) 
Ans. Parallel algorithms are defined as such because multiple sections of 
the algorithm are designed to run simultaneously on more than one processor. 
These sections compose the parallel part of the algorithm. Within the parallel 
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Sem) 
Sections, Every active 


r bet for every fine-grained algorithms, 


> 


than are available thereby reducin 


to reco gnize the parallelism within an algorithm and analyze its granula 
guide a programmer to the best parallel paradigm for task at hand. 


ing able 
rity will 


PRINCIPLES OF MULTITHREADING — MULTITHREADING | 
SSUES AND SOLUTIONS, MULTIPLE-CONTEXT PROCESSORS 


est 


SRE! ee: 


Q.53. Describe multithreaded architecture and its computation model. 
Or 
Discuss the principles of multithreading. (R.GPV., June 2013) 


Ans. As shown in fig. 4.39 (a), a multithreaded massively parallel 
processing (MPP) system is modeled by a network of processor and memory. 
A global address space is formed by the distributed memories. To analyze 
the performance of this network, the following machine parameters are 
used — Ea 

(i) The Latency (L) — On a remote memory access, this 1s + 
communication latency. The value of latency comprises, cache-miss penalty, 
the network delays and delays due to contentions in split transactions. 

(ii) The Number of Threads (N) — In each processor, the n = 
of threads can be interleaved. A context made up of a register set, a prog" 
counter and the needed context status words represents a thread. 


umber 


requests by 
processor wa 


š Overhead (© vill SPP e relies On the 


itching is tim Sen ta gy 
jii) The Cont ing in a process” pa to maintaining 
os ning context gn unt of processor states 
jn pore mand the & les 
lost in P hanism an ifies the cy 
: ec — It specie 
switch reads switches (R) — 1t SP te accesses, the 
ctive thre rval between jce. For remote ®ve of 
A (iv) The Inte d by remote references sa combination 
n switches trigger” = rate of requests. tis 
petwee known as iour. 
; = 1/R1s rogram behavic < to decrease the rate of 


rnative is to remove 


iting using multithreading. 


Rate of Requests (p = 1/R) 


(a) The Architecture Environment 
Thread Synchronization 
Initial Schee nE Sam 
Overhea 
e 


Threads of Parallel Computation 


Inter-computer 
Communication 
(Distributed Memories) 


(b) Multithreaded Computation Model 
Fig. 4.39 Multithreaded Architecture and its Computation Model 


Multithreaded Computations — In 1992, the structure of the multithreaded 
parallel computations model has given by Bell as depicted in fig. 4.39 (b). The 
computation begins with a sequential thread (a), followed by supervisory scheduling 
(b) in which the processors start threads of computation (c), by intercomputer 
messages which update variables among the nodes when the computer has a 


( ), 


Computation 


communicatio 
multicomputers 


_» Because a Specific amount 


: 
omplete a computational grain, 


: multithreading i 
Q.54. What is multithreading ? 


Ans. Refer to Q.53. RGRK, June 2915 


Q.55. Describe Problems of asynchrony. 


Ans. MPP work as i 
ynchronously in a network i 
two basic latency Problems triggered by N aaea Hi 
synchronizing loads. n Temots loads and 
Fig. 4.40 (a) shows the remote load situation, in which vari 


are located on nodes N2 and N3, respectively. They need to be b ables m and n 


the execution of two remo iti i 

and pn be the pointers to ae s = r Pees ae m 
> r y. Ine two rloads may be issued 

from the same thread or from two distinct threads. The context ofthe computatio 

on NI is denoted by the variable CTXT. It can be a process identifier a ae 

pointer, a frame pointer, a Current-object pointer, etc. Generally, the variable 

names such as VM, VN and R are interpreted relative to CTXT. 

The idling because of synchronizing loads is shown in fig. 4.40 (b). In 
this situation, concurrent processes computes m and n. Also, we cannot say 
exactly when they will be ready to read for node N1. The ready signals that 
are ready | and ready 2 may arrive at node N1 asynchronously. In the producer- 
consumer problem, it is a typical condition. Busy-waiting may result. 


(a) The Remote Loads Problem (b) The Synchronizing Loads Problem 
Fig. 4.40 Problems Caused by Asynchrony 
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the major issue is how to prevent idling in node NI 
s. The latency due to the remote loads is an architectural 

The latency due to the synchronizing loads also depends on scheduling 
ny takes to compute m and n, which may be much greater as 
i y. The remote-load latencies are often predictable, 
are unpredictable. 


remote loads, 
he load operation 


In 
during t 


roperty. The © 
En the time 1t 5 
mpared to the transit latenc t 
Sereas the synchronization load latencies | 
E N iItithreading. Also writes multithreading 

K ite principles of multithreading. Also writes multithrea 
0.56. Write princip ili sang 

issues. 


Ans. Refer to Q.53 and Q.55. 


, lain multithreading issues and solutions briefly. 
ane (R.GRV., June 2010) 
Ans. Multithreading Issues — Refer to Q.55. 

Two solutions for overcoming the asynchrony problems are as follows — 


(i) Multithreading Solutions — This solution multiplexes among 
number of threads for asynchrony problems. When a remote-load request is 
issued by one thread, the processor starts operate on some other thread and 
so on, as shown in fig. 4.41 (a). Obviously, the thread switching cost is very 
less as comparison to the latency of the remote load. Otherwise, the processor 
might wait for the remote load’s response. 

Large number of threads are required to hide internode effectively when 
the internode latency increase. Also make sure that messages carry continuations. 
Consider, we switch to thread T., that issues a remote load after issuing a 
remote load from thread T, as shown in fig. 4.41 (a). The responses may be 
return in some different order. This happens due to the requests traveling distinct 
‘distances, via different degrees of congestion, to destination nodes having different 


loads, etc. 
Node N1 


ctxt] ctxt2 


l ii | 


(a) Multithreading Solution 


Node N2 


Node N1 


Directory 
A : Import; Shared 


A: Import; Exclusive 


Processor 


B : Export N2, Exclusive 


Processor 


B : Export Nl, N16;Shareq 


(b) Distributed Cacheing 
Fig. 4.41 Solutions for Overcoming the Asynchrony Problems 


On is di A 
E E E A ne ee 
$ i , with the result that it can be 
enabled again on the response of the arrival. These thread identifiers are kno 
as continuations on messages. To name a sufficient number of threads waite 
for remote responses, a large continuation space is offered. : 
(ii) Distributed Cacheing — Fig. 4.41 (b) illustrates the basic idea of 
distributed cacheing. In distributed cacheing, every memory location contain 
an owner node. For instance, node N2 owns A and node N1 owns B. To 
maintain an import-export lists, directories are used that specify the type of 
data, i.e. shared or exclusive. 

To cover the cache loading effects, the directories multiplex among a 
small number of contexts. Directory based coherence protocols are 
implemented by the Stanford Dash, KSR-1 and MIT Alewife. Note that the 
distributed cacheing. Although, the two methods are merged to resolve both 
types of remote-access difficulties. 


0.58. Describe multiple-context (or multithreaded) processor model. 
Also, discuss an example of multithreaded processor. 


Or 
What is multiple-context processors ? (R.GBV., Dec. 2010) 
Or 


Write short note on multiple-context processors. (R.GP.V., Dec. 2017) 
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note the amount of time, 
f n the corresponding state. In a multithreaded 
ts is interleaved to minimize the value o 


ent in the magnitude of switching. 
processor, the processor 


ycles includes the following four states namely, 
d during its lifetime. A processor Is switching 
e context to another. A processor IS busy 
Otherwise all contexts are blocked 
t most one context running or 
xt until it issues an operation 
s C cycles, 


idle and switching d no! 
1, that the processor is i 
ution of various contex 


ber that there is no increm 
f the several contexts on the 


Busy, 
large interva 
machine, exec 
idle, but remem 

By the disposition 0 
state is find out. A context c 
ready, running, leaving and blocke 
when making the transition from on 
when there is a context in the running state. 
and we say the processor is idle. There can be a 
leaving. The processor is busy by a running conte 
that needs a context switch. In the leaving state, the context spend 
thereafter goes into the blocked state for I cycles, and at last reenters the ready 
state. At the end, the processor will select it and the cycle will begin again. The 
model depicted in fig. 4.42 (a) consider one thread per context. Here, each 


context is expressed by its own 
program counter (PC), process status 
word (PSW) and register set. 


` o 
Fig. 4.42 (b) illustrates a | 5 x N Contexts . 
multithreaded processor with three | £ : ase Per 
thread slots. The processor is gi oe =a ontext 
Gate as processor is givet |5 |. frar] 2E 
ith various instruction queue unit and | = — 
decode unit pairs, known as thread | 9 C] [esw 


slots as depicted in fig. 4.42 (b). An 
instruction fetch unit and all functional 
units are physically shared among 
logical processors. Each thread slot 
IS connected with a program counter 
built a logical processor. 


(a) Multithreaded Model 
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Instruction Cache 


Instruction 
Queue Unit 


Schedule 
Unit 


Schedule 
Unit 


FP FP FP Load/St 
AOOO- Bim 


z Queue Registers 


Integer 
Multiplier 


Register Set Register Set Register Set Large Register 
(Allocated for (Allocated for (Allocated for Files and Queue 
Executing Thread) Waiting Thread) Ready Thread) Register 


(b) A Three-thread Processor Example 
' Fig. 4.42 Multiple-context Processor Model and an Example Design 


There is a buffer in.an instruction queue. This buffer stores some 
instructions succeeding the instruction indicated by the program counter. The 
size of the buffer should’ needed to be'at least B = N x C words, in which N 
represent the number of thread slots and'C represent the number of cycles 
needed to access the instruction cache. 

At most B instructions are fetched by an instruction fetch unit for one 
thread every C cycles from-the instruction cache. An instruction fetch unit 
also tries to fill the buffers in the instruction queue unit. This fetching operation 
is done in an interleaved manner for multiple threads. Thus, in one instruction 
queue unit, the buffer is filled once in B cycles. 


A thread can preempt the prefetching operation upon encountering a b 
instruction. There might be a bottleneck due to the instruction cache and fetch 
unit for a processor with many thread slots. In these conditions, there Is 
requirement of another cache and fetch unit. Simulation results show that by 
executing two and four threads in parallel on a nine-functional-unit processo 
a 2.02 and a 3.72 fold speedup, respectively, can be obtained over a traditional 
single-thread processor... 


ranch 


9, Describe different context-switching policies. 


The various multithreaded architectures are differentiated by the 
d. Following are the four switching policies — 


0.5 


Ans. i sI 
context-switchin g policies use 
().- Switch on Block of Instruction — There is interleaving of blocks 


of instructions from various threads. This enhances the cache-hit ratio because 
of locality. It is also useful for single-context performance. 

(ii) Switch on Cache Miss — This policy, applies to the case where a 
context is preempted when it results in a cache miss. In this situation, R 
denotes the average interval between misses (in cycles), and I denotes the 
time needed to satisfy the miss. Here, the processor switches contexts only if 
ill be delayed for a number of cycles. 
ry Load — This policy permits switching on every 
ults in a miss or not. In this situation, R 
It is considered that a context is 
this occurs only if the load 


one W 

(iii) Switch on Eve 
load, irrespective of whether it res 
e average interval between loads. 


cles after every switch. However, 
s in the case of a switch-on-load processor. 


(iv) Switch on Every Instruction — This policy permits switching 
irrespective a load or not. That is, instructions are 
s on’a cycle-by-cycle basis. So, there will be 
ch will benefit pipelined execution. 


denotes th 
blocked for I cy 
results in a cache mis 


on every instruction, 
interleaved from various thread 


independent successive instructions, whi 
Although, the cache miss may enhance because of breaking of locality. It 


has been observed that the interleaving of contexts in cycle-by-cycle fashion 
offers a performance benefit over switching at.a cache miss where the 
context interleaving could hide pipeline dependences and decrease the context 


switch cost. 

0.60. Discuss the processor efficiency of multithreaded (multiple- 
context) processors. i 

Ans. A context is executed by a single thread processor till a remote 
reference is not issued (R cycles), thereafter it is idle till the reference is not 
complete a cycles). Clearly, there is no context switch and there is no overhead 
of it. This behaviour can be modeled as an alternating renewal process 
containing a cycle of R + L. Here, L and R correspond to the amount of time 
during a cycle that the processor is idle and busy, respectively. Hence, the 


efficiency of a single-threaded machine is — 


sb 
-R+L 

This represents the performance reduction of such a processor ina parallel 
system with a more memory latency. 


Ẹ = 


a. 
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o a new context, memory latency can be hidden w; 
owever, we consider that the switch takes C ¢ 3 with 
dered that the run length between switches is i es of 
ber of contexts. When a switch takes place lee 
always a context ready to execute, hence, the processor is busy. Fig p is 
shows efficiency of processor analyzed under two different situati re 43 
snapshots of context switching in the saturation region is shown in Bea 
(a) and for linear regions is shown in fig. 4.43 (b). Fig. 4.43 (c) depicts 
the processor efficiency represented as a function of the number of M a 


By switching t 
multiple contexts, h 
overhead. It is consi 
with a sufficient num 


exts 


Cont 


Contexts 


(Œ) Snapshots of Context $ witching in the Linear Region 


Processor 
Efficiency 


o 


1. 


0 
Number of Contexts 


(c) Efficiency Curve 
Fig. 4.43 Context Switching and Processor Efficiency 


p 


(i) Saturation Region — The processor work ai -IV 2 
utilization in saturated region. In this situation, the cycle = with maximum 
is R+C and the efficiency 1s — e renewal process 

E = R 
at RSC 
tis seen that the efficiency is independent of the latency. | 
hen the number of context further a 


In saturation, i 
addition, it does not change W 


If the ti 
than the time needed to process 


means that (N — 1)(R+C) is great 
under constant run length as 


Ng = 


me the processor spends servicing the other threads is greater 
a request, then saturation is achieved. It 


er than L. It provides the saturation point, 


L+R+C 

R+C 

After a context switch, there are no ready 
less the saturation point. Therefore, the 


The time needed to switch to a 
process the 


(ii) Linear Region — 


contexts if the number of contexts is 
processor will never experience busy cycles. 
a remote reference is not issued and 


ready context, run it till 
reference is equaltoR+C+L. During the time when N is below the saturation 
point all the other contexts have a turn in the processor. Hence, the efficiency 


-is represented as — 


C=0 Number of 
Context = 6 


C=0 Number of 
Context = 2 


fficlency ———»> 
td 
a 
© 
fficlency ———» 
© 
an 
© 


x 
“ 


I 
= 
ra 
S 


i 
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(a) Two Contexts per Processor (b) Six Contexts per Processor 
Fig. 4 fi 
ig. 4.44 Processor Efficiency of a Multithreaded Architecture 


mm 
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It-is seen that the efficiency increases linearly as the number of ¢ diie 

still the saturation point is not arrived and thereafter it remains ae Xts 
w remote reference rate untill the context switch is too cheap ne 
derlines the significance of the C/R ratio and provides t 
n the efficiency of a multithreaded processor, he 


increase 
There is lo 
E cat equation un 


fundamental limit o 
The processor efficiency can be represented as a function of the memory 


latency L with an average run length R = 16 cycles as shown in fig. 4.44, The 
C = 0 curve is analogus to zero switching overhead. Nearly 50% efficienc 

can be obtained with C = 16 cycles. These results depend on a Markoy er 
of multithreaded architecture which was proposed by Saavedra. Note that 
multithreading increases not only the processor efficiency, but also the network 


traffic. 


0.61. Discuss about deterministic scheduling models for multiprocessor 


(R.GPV, June 2016) 


Ans. In deterministic scheduling models for multiprocessor system, all 
the information needed to express the characteristics of the problem is known 
before a solution to the problem, that is, a schedule, is attempted. Such 
characteristics are the execution time of each task and the relationship between 
the tasks in the system. The obj ective of the resultant schedules is to optimize 
one or more of the evaluation criteria. For example, in deterministic models, 
the execution time of each process can either be interpreted as the expected 
processing time or as the maximum processing time. In the former case, the 
length of the schedule represents a rough estimate of the mean length of the 
computation and in the latter case, the time to complete the schedule would 
be considered the maximum time to complete the system of processes. The 
motivation for this objective is that, in many cases, a poor schedule can lead 
to an unacceptable response time or utilization of system resources. 
Deterministic models are not very realistic and do not take into consideration 
the irregular and unpredictable demands made on the multiprocessor system. 
Deterministic schedules are usually displayed with timing diagrams called 
Gantt charts. The flow time of a process is equal to the time its execution is 


completed. 


0.62. Describe the two levels of threads. (R.GBV., June 2016) 


Ans. The two levels of threads are kernel-level thread and user-level thread. 
Kernel-level threads are supported directly by the operating system. The kernel 
performs creation, scheduling, and management of threads in kernel space. 
Since the thread management is done by the operating system, kernel threads 
are generally slower to create and manage than user threads. However, since 
the kernel is managing the threads, if a thread performs a blocking system 
call, the kernel can schedule another thread in the application for execution. 


system. 


Iso, in a multiprocessor environment, the kernel c 
A f e kernel can schedule threads on 
different processors. 
In the kernel-level threads, the kernel have knowledge about the tH 
and management of it. Here, no run-time system is needed, as sta ss 
, / n 


fig. 4.45. t 
Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 


HN 


Fig. 4.45 Kernel-level T hreads 


User-level threads are supported above the kernel and are implemented 
by a thread library at the user-level. The library supports for thread creation, 
scheduling, and management with no support from the kernel. Because the 
kernel is unaware of user-level threads, all thread creation and scheduling are 
done in user space without the need for kernel intervention. Thus, user-level 
threads are fast to create and manage. However, they have drawbacks. For 
instance, if the kernel is single-threaded, then any user-level thread perfor- 
ming a blocking system call will cause the entire process to block, even if 
other threads are available to run within the application. 


User Space 


Kernel Space 


Fig. 4.46 shows the structure of user-level threads. 
Thread0 Thread1 Thread2 Thread3 Thread 4 


User Space 


Run-time System 


Kernel Space 


—— 


Fig. 4.46 User-level Threads 


The user-level threads run on upper of run-time system. The run-time 
a is a collection of procedures and manages threads. Whenever any 
thread executes a system call, it calls a run-time system procedure. l 


~ aes en how thread level parallelism within a processor can be 
eee ? With suitable diagrams, explain simultaneous multithreading, 
sign challenges and potential performance enhancement. : 


(R.GP.V., June 2017) 


| 


erm. = 
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m within a processor can be exploited jp 


evel parallelis i 
sses by providing hardware support in a processor 
niques which provides hardware Support in 
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Ans. Thread | 
multiple proce 
owing two tech 
r for thread level parallelism - . 
(i) Coarse-grained Multithreading (CGMT) — When an k 
ș waiting for main memory access it requires few tens of process a 
cause of cache miss. So at that time in CGMT processor : 


hed to another thread and start processing that thread. It utilizes the 
; le time and exploit the thread level parallelism. 
grained Multithreading (FGMT) - It provide fine-grained 


ji) Fine-g : 
(ii) sors are shared amongst multiple executing 


-zg of processor i.e. proces multipi 
sharing Ot P k cycle. It provides simultaneous 


i individual cloc 
ds on the basis of indivi 3 
threa fn of multiple threads by a processor. Each thread is allocated by 


clock cycle in turn. 
lize the clock cycle and exploit the thread level 


single or 
There are foll 


processo. 


process i 
clock cycle be 
swit 
clock cyc 


execu 


processor’s 
Both mechanisms uti 


parallelism. . 
Simultaneous Multithreading (SMT) -— It is a technique for 


verall efficiency of superscalar CPUs with hardware 
rmits multiple independent threads of execution to 
ed by modern processor architectures. 
n be retained with little 


improving 0 
multithreading. SMT pe | 
better utilize the resources provid ! 
ynchronization mechanisms ca 
erformance, if appropriate measures are taken 


e, do not consume a share of 


Existing thread s 
impact on SMT processor p 
to ensure threads waiting for a semaphor 


execution resources. 
Combines the Advantages of ILP and F 


BBO 
Vertical Waste 0O O O a Waste 


ine Grain Multithreading 


Hoz. Waste Still 
Present but not 
as much Vert. 


of the Order of me 
60% of Overall A E O varie na 
igy = = = Disappear as this 
O O O O Figure Implies 
EEL 
ILP 


Fig. 4.47 SMT 


Design Challenges of Simultaneous Multithreading — 

(i) If SMT is implemented with coarse-grain, because of deep 

pipeline of superscalar processor which is scheduled dynamically, simultaneo?" 
multithreading will not gain much performance. 
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(ii) In a preferred thread implementation of simultancous 
multithreading, when preferred thread experiences a stall then it will results in 


reduction of throughput of processor. 


Potential Performance Enhancement — Use of simultaneous 


multithreading with fire-grained implementation removes the deep pipelining 
hen it contains a preferred thread. Preferred thread provide 


effect W 
d saves some performance advantages. 


multithreading an 
0.64. How is multithreading used to exploit threat level parallelism 


hin a processor ? Explain with example. (R.GP.V.,- Dec. 2017) 


wit. 
Ans. A thread is like a process in which it has state and a current program 


counter, but threads typically share the address space of a single process, 
ga thread to simply access data of other threads within the same 


ermittin i : 
p i in which many threads share a processor 


process. Multithreading is a process, i e; 
without desiring an intervening process switch. The ability to switch between 


threads quickly is what enables multithreading to be used to hide pipeline and 
memory latencies. For exposing more parallelism to the hardware, 
multithreading is a primary scheme. In a strict sense, multithreading employs 
thread-level parallelism in improving pipeline utilization. However, raising 
performance by using ILP has the major merit that it is reasonably transparent 
to the developer, because thread-level parallelism may be quite limited or tough 
to exploit is some applications. 

Multithreading permits multiple threads to share the functional units of 

a single processor in an overlapping manner. In contrast, a more general 
procedure to exploit thread-level parallelism (TLP) is with a multiprocessor 
which has multiple independent threads operating at once and in parallel. 
Unlike a multiprocessor that duplicate the entire processor multithreading 
does not. In place of multithreading shares most of the processor core 
among a set of threads, duplicating only private state, like program counter 
and registers. Many current processors incorporate both multiple processor 
cores on a one chip and give multithreading within each core. Duplicating 
id iene n i aie inlay ie creating a separate register 
via the virtual memory aie tha lr paoa o Daa 
Moreover, the hardware must nea es! awe ma ee 
ae eaid Pikes Fp the ability to change to a different, 
read eats a he a icular, ás compare to a process switch, a 
alea a an efficient. A thread switch needs hundreds 
ae a ae r cycles. For multithreading hardware to get 
p ments, a program must comprise multiple threads 


ae tt att 
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hich could execute in concurrent manner. 
which f } 


recognizes these threads. a 
An online transaction processing system contains natural 


queries and updates which are presented by 
arallelism, since they 


Example — 
prallelism among the multiple uert | 
requests. Several scientific applications has natural p : k 
model the 3-D, paralleľ'structure of nature, and ihat structure may be exploited 
by using separate threads. Even desktop applications that employ modern 
windows based operating systems, have various active applications running, 


giving a parallelism source. 
Also refer to Q.63. 


E 


r 


EPEE E E 


PARALLEL PROGRAMMING MODELS, SHARE 
MODEL, MESSAGE-PASSING MODEL, DATA-PARALLEL MODEL, 
OBJECT-ORIENTED MODEL, FUNCTIONAL AND LOGIC MODELS 


programming are shared-variable model, message-passing model, object- 
oriented model, data-parallel model, and functional and logic model. 


Q.2. Explain the following terms in shared-variable model — 
(i) Shared-variable communication 

(ii) Critical section == 

(iii) Protectedaccess ——_ 

(iv) Multiprogramming 
(v) Multitasking 

(vi) Multiprocessing — 

(vii) Multithreading 

(viii) Program Partitioning and replication 
| (ix) Scheduling and synchronization 
| () Cache coherence and protection. 


| Or 
a P e various semantic issues in parallel programming. How can 
esolved : 
(R.GEBV., June 2010) 


a 


; Or 
Write short note on shared-variable model. (R.GPV, Dec. O 
Or 


Describe about Shared-variable model. (R.GP.V., June 201 2) 
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Ans. (i) Shared-variable Communication — Multiprocessor Drop 

C arte 7 ~ ae ra. 
mming depends on the use of shared variables in a common memory for iP c 
(Interprocess Communication). Shared-variable IPC requires the use of shared 


niemory and mutual exclusion among multiple processes accessing the sq 


s depicted in fig. 5.1 (a). 


ar, ae eh . . me 
set of variables as depicted in fig n 


es 


Process 1 


—————— 


Shared Variables 
in a Common Memory 


(a) IPC using Shared Variables 


Message (Send/Receive) 
(Communication Channel) 


Coe Ee 


(b) IPC using Message Passing 
Fig. 5.1 Interprocess Communication 
In tightly coupled multiprocessors fine-grain MIMD parallelism is used. 


The implementation of interprocessor synchronization is done either 
conditionally or unconditionally, on the basis of the mechanisms used. 

The important issues involved in the use of this model are, protected 
access of critical sections, atomicity of memory operations, fast synchronization, 
shared data structures and fast data movement techniques. 

(ii) Critical Section (í CS) — A code segment accessing shared variables 
is referred to as a critical section, Only one process at a time can execute it and 
once begun, must be finished without interruption. It means that a critical section 
operation is indivisible. It meets the following four requirements — 

(a) Mutual Exclusion (Mutex) — At most one process can 
execute the critical section at a time. 
(b) Nonpreemption — After entering the CS, there is no interrupt 
until completion. 
(c) No Deadlock in Waiting — There is no circular wait by 
two or more processes attempting to enter the critical section because at least 
one process will enter. 


(d) Eventual Entry — A process trying fi — il 
eventually enter. y P rying for entering its CS w 


ba 
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cess — Preventing race conditions where concurrent 

(iii) Protected Access Preventing Face c ditions ea elem 
different orders provide different resu ts is the key proble: 

ofa critical section. The performance 1s influenced by the 

there is too big boundary of a critical section, It 


processes running in 
related with the use 
nularity of a CS. When 


ara 
iat limit parallelism because of long waiting by competing processes. 
If the boundary of the critical section is too small, it may cause unnecessary 

tion is to employ conditional 


oftware overhead. The solu 
duty CS for achieving a b 
prevention of system deadlocks i 
For structured programming monitors 


code complexity or s 
CSs or to reduce a heavy- 
The implementation of CSs and 


by binary and counting semaphores. 


are appropriate. 

_ Special atomic op 
parallelism, compilation sup 
scheduling parallel events and 
shared-variable programming. Al 
consistency model used. 

(iv) Multiprog 
executing on a single proce 
the system resources. A multipro 


or for executing multiple programs. 

Multiple programs can be run concurrently through time sharing use of all 
the processors in the system by a multiprogrammed multiprocessor. The 
interleaving of CPU and I/O activities is done for multiple programs. The processor 
switches to another program in case of a program entering in I/O mode. Therefore, 
there is no limitation for multiprogramming to a multiprocessor. The 
implementation of multiprogramming is often done even on a single processor. 
Most of recent OSes are multiprogramming for eg. windows XP, Linux. 

(v) Multitasking or Time Sharing — Multitasking (or time sharing) 
is a logical extension of multiprogramming. In multitasking a single large 
program is divided into multiple interrelated tasks, which are run concurrently 
on a multiprocessor. Cray multiprocessors are used for this purpose. Hence, 
multitasking is used to execute two or more tasks of a single program 
concurrently. If a job is multitasked properly then it takes less execution time. 
By adding codes in the original program multitasking is obtained in order to 
give proper linkage and synchronization of interrelated partitioned tasks. 
Hire Ca between multitasking and not multitasking. 
i pig only when overhead is low. Sometimes, it 
eae, Te ee A, sofa program cannot be partitioned into parallel 

Se a , re implementation of multitasking, tradeoffs in 
multitasking must be analyzed. j 


alanced performance. 
s achieved 


erations for IPC, new language constructs to express 
port to achieve parallelism and use of OS for 
preventing resource conflicts are required by 
L of these are dependent on the memory 


ramming — It refers to multiple independent programs 
ssor or on a multiprocessor by time-sharing use of 
cessor is useful in solving a single large problem 
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(vi) Multiprocessing — The implementation of multiprograming at 


the process level on a multiprocessor, is known as multiprocessing. There are 
two types of multiprocessing. TI 


ne multiprocessor works in MIMD mode. 
a aeae > 


when IPCs are handled 
MPMD (Multiple Programs over Multiple Data Streams) mode when IPCs are” 


at the instruction level. The multiprocessor works in 


multiprocessing is defined with fine-grain instruction-level p 


multiprocessing is defined with coarse-grain procedure-level parallelism. Shared 
variables are used to obtain JPC_ in both modes. 


_ (vii) Multithreading — The conventional UNIX/OS contains a 


single-threaded kernel where one process can get | OS kernel 


“service ata 
time It is required in a multiprocessor system to extend the single kernel to 
be multithreaded. The aim is to enable multiple threads of lightweight 
processes to share the similar address space and to be run by the similar or 
different processors simultaneously.. An extension to the concepts of 
multitasking and multiprocessing is the concept of multithreading. The 
objective is to use fine-grain parallelism in modern multiprocessors made 
with multiple-context processors_or superscalar processors with multiple 
instruction issues. 


Sacha —— 


resolved. 


In preserving event order and in securing data coherence, the levels of 
sophistication increase from monoprogramming to multitasking, to 
multiprogramming, to multiprocessing, and to multithreading in that order. In 
parallel thread operations, memory management and special protection 
mechanisms must be developed to guarantee correctness and data integrity. 

(viii) Program Partitioning and Replication — Program partitioning 
is @ method for partitioning a large program and data set into a number of 
small pieces for parallel execution by multiple processors. Both programmers 
and the compiler are involved in program partitioning. Parallelism detection by 
users is often explicitly expressed with parallel language constructs. Program 
restructuring methods can be used to transform sequential programs into a 
parallel form more suitable for multiprocessors. Ideally, this transformation 
should be done automatically using a compiler. ` 
The duplication of the similar program code for parallel execution on 
multiple processors over various data sets is called as program replication. 
Partitioning is appropriate for a shared-memory multiprocessor system. On 


other 
passing multicomputers. © > 
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(ix) Scheduling and Synchronization — Scheduling of divided 
program modules on parallel processors is very difficult as compared to the 
scheduling of sequential programs on a uniprocessor. Static scheduling is 
carried out at post compile time. The benefit of static scheduling is low 
overhead. However, the disadvantage is a possible mismatch with the run- 
time profile of each task and hence possibly poor use of resources. Dynamic 
scheduling catches the run-time situations. Although, dynamic scheduling 
needs preemption, fast context switching between programs, and OS support. 
The benefits of dynamic scheduling are good resource utilization at the expense 
of higher scheduling overhead. Both, static and dynamic techniques, are 


jointly used in a sophisticated multiprocessor system which need higher 
efficiency. 


Interprocess communication is carried out at the process level in a 
conventional UNIX system. Any processor can create process: All processes 
asynchronously accessing the shared data must be protected with the aim 
that only one process is permitted to access the shared writable data at a 


time. This mutual exclusion property is applied using locks, semaphores, 
and monitors. 


Virtual program counters can be allocated to various processes or threads 
at the control level. Counting semaphores or barrier counters specify the 
completion of parallel branch activities. The atomic memory operations like 
Test&Set and Fetch&Add can also be used to get synchronization. The longer 
overhead may be needed by software-implemented synchronization. The 


combining 
networks. 


(x) Cache Coherence and Protection — The data consistency 
between private caches and the shared-memory must be maintained by 
multiprocessors, apart from maintaining data coherence ina memory hierarchy. 
An invalidation or update is needed by the multicache coherence problem after 
each write operation. For implementation, these coherence control operations 
need special bus or network protocols. If the value returned ona read instruction 
is always the value written by the latest write instruction on the same memory 


location, then a memory system is called as coherent. The accessing order of 


the main memory and the caches creates a 


reat diffe i : 
results. 8 rence ın computational 


e shared memory in multiprocessor is utilized in different consistency 
: zi Sequential consistency needs strong ordered memory accesses on a 
global basis. An access cannot be issued by a processor till the most recently 


= = 
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shared writable memory access has not been globally performed, The orq 
and coherence are applied by a weak consistency model at i 
synchronization points only. There may be restricted programming vee 
release consistency or processor consistency, however memory pe 
is expected to enhance. 


0.3. Write and explain four operational modes used in Program 


; ss mi 
multiprocessor system by giving an example of each. (R.GEV, ting 


Ans. Four operational modes used in programming multiprocessor syst 
are as follows — em 


(i) Multiprogramming — Refer to Q.2 (iv). 

(ii) Multitasking or Time Sharing — Refer to Q.2 (v). 
(iii) Multiprocessing — Refer to Q.2 (vi). 

(iv) Multithreading — Refer to Q.2 (vii). 


Q.4. Explain the concept of interprocessor synchronization. 
(R.GP.V., Dec. 2009) 
Ans. Cooperating processes in a multiprocessor environment must Often 
communicate and synchronize. Execution of one process can influence the 
other via communication. Interprocess communication employs one of two 
schemes — use of shared variables or message passing. Often the process that 
communicate do so via a synchronization mechanism. A process executes 
with unpredicatable speed and generates actions or events which must be 
recognized by another cooperating process. The set of constraints on the 
ordering of these events constitutes the set of synchronization required for the 
operating processes. The synchronization mechanism is used to delay execution 
of a process in order to satisfy such constraints 
Two types of synchronization are commonly employed when using shared 
variables. First one is mutual exclusion and another one is condition 
synchronization. Mutual exclusion ensures that a physical or virtual resource 


s held indivisibly. Another situation occurs in a set of cooperating processes 


when a shared data object is i ; 
cee ject is in a stat i i ; 
Glen mente: ae e that is inappropriate for executing a 


y process which attempts such an operation should be 


delayed u : 

of aie ati i ae data object changes to the desired value as a result 
called condition See executed. This type of synchronization is sometimes 
cae nization. The mutual-exclusive execution of a critical 
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us and asynchronous message-passing model. 


0.5. Explain synchrono 


Or 
i re. 0, 2015, 
Explain message passing model briefly. (R.GPRM, Dec. 2010, ) 
Or 
2012, 
Draw message-passing model. (R.GP.V., June ) 
. Or . 


programming model. 
(R.GPV., June 2014) 


Ans. (i) Synchronous Message Passing — In the synchronous message 


the sender process and the i er process-must-be-syn 
just like a telephone call using circuit switched lines. Generally, 
uffers in the communication channels. For this reason, 
ation is blocked by channels being busy or in error 
ent through a channel at a time. 


q uN f y 
Write short note on message passing 


passing, 
time and space, 
there is no need of b 


synchronous communic 
because only one message is permitted to be s 
Apart from having a time connection, the sender and receiver must also 


be linked by physical communication channels in space. There must be a path 
of channels; which is ready to enable the message passing between them. In 
other way, the sender and receiver must be coupled in both time and space 
synchronously. If one process wants to communicate and the other does not, 


then the one that wants, is blocked (or wait). In this case, synchronous 


communication is known as a blocking communication scheme. 


- 


(ii) Asynchronous Message Passing — In asynchronous paradigm, 


ee 


the passing of a message must not synchronize the sending and the receiving 
process in time and space. Buffers are often used in channels. This results in 


paiaina 
~oe 


nonblocking in message passing provided adequately large buffers are used or 


the network traffic is not saturated. : 
~~~ Although, arbitrary communication delays may be found as the sender 


may not know when the message has been received till acknowledgement is 


not le from the receiver. This approach is same as a postal service 
using-mailboxés (i. i izati 
sing-mai és (i e., channel buffers) having no synchronization between 
senders and receivers. : 
» 


As : ; 
oo e a message passing provides nonblocking where two 
o not need to be ‘Synchronized either in time or in space. The: 


sénder i i 7 Ese ar es cated A 
is permitted to senda message with no blocking, irrespective of whether 


the receiver is ready or not. aaa 


`The buffers are used to hold the m 


gi È . 
channels is as essages along the path of the connecting 


S S ee communication” The sender will eventually 
use channel buffers are finite. Buffers are not required in a 


synchronous i 
multicomputer because only one message is permitted to pass 


y. A 
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ta time. How to distribute or duplicate the program codes 
cessing nodes is the crucial issue in programming 
tween computation time and communication 
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through a channel a 
and data sets over the pro 
model. There exist tradeoffs be 
at must be considered. 
tariable and message passing model. 
(R.GPV., June 2013, 2015) 


this 
overhead th 


Q.6. Explain shared \ 


Or 


variable model and message passing model in detail. 


Explain shared y 
(R.GP.V, Dec. 2017) 


Ans. Refer to Q.2 and Q.5. 
(R.GPV., June 2016) 


Q.7. Explain array processing. 
rforms computations on 


Ans. An array processor is a processor. that performs computa 
data. The term is used to refer to two different types of 


ed array processors an auxiliary processor attached to 


processors. An attached 

a general-purpose computer. It is intended to improve the performance of the 
host computer in specific numerical computation task. An SIMD array 
processor isa 1 processor t r that has a single i instruction, multiple — -data organization. 
It “manipulates vector instructions 
aaa ee Ee EA NE 
truction. Although both types | of array processors 


Ree ete 


responding to a ‘common instructio 
heir internal organization... 


manipulate vectors but, differ in their 1 
d array processing. Fig. 5.2 


The SIMD form of parallel ; processing is calle y processin 
ocessing elements 


sors. A two- -dimensional grid of processing 
broadcast from a central control processor. / As 
ents execute it simultaneously. Each 
st neighbours for purposes 
vided on both rows 


illustrates array processi 
executes an instruction stream 
each instruction is broadcast, all elem 
ng element is connected to its four neare 
Endaround connections may be pro 


shown in the fig. 5.2. 


processi 
of exchanging data. 
and columns, but they are not 


eee 
Control s 
Processor P 
e 
e 
e e 
Broadcast . 
Instructions 
e e e 
Grid of 
Processing Elements 


Fig. 5.2 


by means of multiple ‘functional units 
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Array processors are highly specialized machines. They are well suited to 
umcrical problems that can be expressed in matrix or vector format. However, 


n 
they are not very useful in speeding up general computations 


0.8. Explain the following terms in data-parallel model — 


(i) Data parallelism 
(ii) Array language extensions 


(iii) Compiler support. 
Or 


Write short note on data parallel model. 
Or 


Explain about data-parallel model. 
Programming SIMD array processors has 


(R.GP.V., Dec. 2010) 


(R.GP.V., Dec. 2014) 


ae 


Ans. (i) Data Parallelism — 
been a challenge for computational scientists after the invention of the IIliac- 
IV computer. In Illiac-IV, the key problem ‘has been to match the problem siz 
with the fixed machine size. It means that large arrays or matrices must a 
divided into 64 element segments before they can be processed by the 64 


SS 


ene 


processing elements. 

The SIMD computers, like the Connection Machine CM-2, provided bit- 
slice fine-grain « data parallelism using 16,384 PEs simultaneously i ina single- 
array confi iguration. This needs a lower degree of array segmentation. Therefore 


N a 


this provided greater flexibility in programming. 


Synchronous SIMD programming is different from asynchronous MIMD ~ 


eee in such a manner that all PEs in an SIMD computer w ork i in a 
step manner, in contrast all processors i inan MIMD computer: run different 


instructions asyn 
ns asynchronously. Consequently, there is no mutual exclusion or 


ege 


synchron 
ynchronization problems related with multiprocessors or multicomputers in 


SIMD computers. 


Ha 
rdware directly controls inter-PE communications. Inter-PE data 


communication is also cant Pe Steer: ee EDL 
ation is also conducted in lockstep fashion, apart fi from. lockstep i in 


IEEE EEEE hater 
m re aA nee 


computing oper 
omputing operations among all PEs. SIMD computers are rather well in 


ole — a YANO 


Se ern on 


explorin 
tee an EA parallelism in large arrays, grids, or meshes of data due to 
ronized instruction executions and data-routing operations 


The con 
trol unit directly run scalar instructions in an SIMD program. 
EEE E] 


mn Re HA Ct entire arn 
a ee nE 


The PEs are lo: 
a 
riie help aa es vector operands from local memories simultaneously 
A i ape ie al address having various offsets in local index registers. 
in similarly. Constant data is send to all PEs simultaneously. 


All PEs 
can be disabled or enabled dynamically in any instruction cycle 


by setting a masking patt or) under | 
ig pattern (i. e., binary vector) under er program. control. 


AST NARS ae pees, 
eS SS 


re dire 
re directly assists masking instructions. An inter-PE routing network, 
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which is also controlled by program control on a dynamic basis, also wad 


(ii) Array Language Extensions — The representation of es 
arallel languages is done by high-level data types, The 


ffers the elimination of some nested_loops in the code, This 


array syntax O ; i 
architecture of the array processor. The DAP Fortran for 


should reflect the processo] DAP 
ihe AMT/Distributed Array Processor CFD for the Illiac-TV, C* for the TM N 


Connection Machine, and MPF for the MasPar family of parallel computers 
are the examples of array processing languages. There should be a globaj 
address space in an SIMD programming language. This eliminates the 


requirement for explicit data routing between PEs. There should be ability in 
o make the number of PEs a function of the problem size 


j 


the array extensions t 
instead of a function of the target machine. 


(iii) Compiler Support — The array language expressions and their 
embedded in familiar standards like Fortran 77, Fortran 


optimizing compilers are apes si an: yr tamaa 
90 and C to assist data-parallel programming. The unification of the program 


execution model, enabling incremental migration of data-parallel execution and 
facilitation of precise control of massively parallel hardware are the main ideas, 


The programmer is permitted by the compiler-optimized control of SIMD 
machine hardware to drive the PE array transparently. The program must be 
separated into scalar and parallel components by the compiler and integrated 


with the UNIX environment. 


ee a 
The array extensions are permitted by the compiler technology to optimize 


data placement, minimize data movement and virtualize the dimensions of the 
PE array. The data-parallel machine code is produced by the compiler to perform 


operations on arrays. 

A programmer is permitted by array sectioning to reference a section or 
a region of a multidimensional array. The designation of array sections is done 
by specifying a start index, a stride, and a bound. Arrays are constructed from 
arbitrary permutations of another array using vector-valued subscripts. Vector- 
valued subscripts are vectors that are used to map the required elements into 
the target array. The implementation of gather and scatter operations is made 
easy by them on a vector of indices. 


For MIMD architecture, the recompilation of SIMD programs is done. 
To develop a source-to-source precompiler to convert, for instance, from 
connection machine C* programs to C programs executing on an nCUBE 


message-passing multicomputer in SPMD mode is the idea. 
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Clearly, a special class of SIMD programs is SPMD programs. They 
ort medium-grain parallelism any synchronization at the subprogram level 
at the instruction level. In this way both synchronous SIMD and 


supp 


instead of 
loosely coupled MIMD computers use the data-paralle! programming model. 


parallel architecture ? How it is important in 


0.9. What is data 
(R.GPV., June 2017) 


uniprocessor and multiprocessor architecture ? 
Ans. The term parallel processing is an efficient form of processing data 


or information simultaneously_or in parallel. It examines the processor-level 


parallelism in computers and focus on the use of multiple CPUs to achieve 
very high throughput and fault tolerance. 


As we know that the performance of computer increases steadily due to 
the hardware technologies and_processor.designs. One way to address these 
issues is to exploit processor-level parallelism, for example while building a 
computer, it contains large number of processors in parallel on common tasks. 
Suppose that a computer P(n) is constructed by combining n copies of a 
single computer P(1). If a task T can be partitioned into n subtasks of similar 
complexity and P(n) can be programmed in such a way that its n processors 
execute the n subtasks in parallel, it would expect P, to process T about n 
times faster than P(1) can process it. One of the advantage of processor-level 
parallelism is tolerance of hardware and software faults. A sequential computer 
will always get fatal in case of CPU failure where as a parallel computer can be 
designed to continue functioning, perhaps at a reduced performance level, in 
the presence of defective CPUs. 


Fig. 5.3 One Dimensional Array of n Processors 


Parallel processing plays an important role in uniprocessor and 


—__---an aii 
In i 


identified in the following six categories — 
(i) Multiplicity of functional units, Saati 
(ii) Parallelism and pipelining within the CPU. 
(iii) Overlapped CPU and I/O operations. ing 
(iv) Use of a hierarchical memory system. <7 
(v) Balancing of subsystem bandwidths. -_ 
(vi) Multiprogramming and time sharing.  — 


AA 


mic io 549 
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f multiprocessor systems, parallel processing improves 
ility, flexibility and availability. The multiprocessor ier 
processors of approximately comparable capahilitic! 
own local memory and private devices. a 
ocessors can be done through the share 


In case O 


throughput, reliab 
contains two or more 
Each processor has its 
communications between the pr 
memories or through an interrupt network. 


The different interconnections used in multiprocessor, between th 
e 


memories and processors are — 
(i) Time-shared common bus 


(ii) Crossbar switch network 


(iii) Multiport memories. 


Q.10. Discuss the following terms in object-oriented model — 


(i) Concurrent OOP 
(ii) Parallelism in COOP 


(iti) An actor model. 
Or 


Write short note on object-oriented model. 
(R.GP.V., Dec. 2010, June 2013) 


Or 
Write short note on object-oriented parallel programming model. 
(R.GP.V., June 2011) 
Or 


Explain object-oriented model. (R.GP.V., June 2014) 


Or 


What is object-oriented model ? (R.GP.V., Dec. 2015) 

Ans. (i) Concurrent OOP — There are three application demands, which 
attribute to the object-oriented programming (OOP). First, individual users 
use interacting processes immensely like the use of multiple windows. Second, 
workstation networks are used as a cost-effective mechanism for resource 
sharing and distributed problem solving. Third, there is advancement in the 
multiprocessor technology to the point of offering supercomputing power ata 
fraction of the traditional cost. 


In fact, program abstraction results in pro 
reusability as is found in OOP. Other areas that supp 


gram modularity and softwar 
orted OOP are the 


development of CAD tools and word processors with graphics capabilities 
Object is a program entity which encapsulates data and methods in 
curren) 


single unit. A natural consequence of the concept of objects is the con 


bane sous 
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the concurrent manipulation of objects in OOP is just like the concu 
of coroutines In conventional programming. peat ube 


An alternative model for concurrent computing on multiprocessor 
multicomputers 1s provided by the development of COOP. Object aie is ie 
different in the internal behaviour of objects and their interaction wl dh 

eac 


other. 
(ii) Parallelism in COOP — There are three common patterns of 


parallelism in COOP — 

(a) Pipeline concurrency is the overlapped enumeration of 
successive solutions and concurrent testing of the solutions because th o 
derived from an evaluation pipeline. o me ae, 


(b) Divide-and-conquer concurrency is the concurrent 
elaboration of various subprograms and the combining of their solutions t 
p . . v 
give a solution to the overall problem. In this situation, no interaction exi 
between the procedures solving the subproblems _ 


These two patterns are shown in fi i 
erns ig. 5.4. Fig. 5.4 (a) sho i 
number generation pipeline. In a linear pipeline of primes ea ae prime 
produced and successively tested for divisibility by previously produced a 
The circled numbers indicate produced integers Ae o 


OOOOOOO 


Number 
Generator 


eeyan (b) Divide-and-conquer Concurrency 
- 5.4 Co 7 
ncurrency Types in Object-oriented Programming 
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A number is entered in the pipeline from the left end and is p 
is divisible by the prime number tested at a pipeline stage + 
numbers, which are indivisible by all the prime numbers feited o 


moved ifi 
OW we Ret 
that stage. n the lent of 
The multiplication of a list of numbers (4, 3, 5, —7, 6, 2, 9 8) us; 
divide-and-conquer method is shown in fig. 5.4 (b). The leaves he 
express the numbers. The problem is recursively partitioned into be i, 
of multiplying two sublists. Each sublist is concurrently evaluated en 
results multiplied at the upper node. the 
(c) Cooperative problem solving is the third pattern, Th 
dynamic path evaluation of various physical bodies under the mutual sfc, 
of gravitational fields is a simple example. In this situation, each object ty 
interact with another. Objects store intermediate results and share them by 


Ba 


passing messages. 


(iii) An Actor Model — An actor model was developed at MIT. It isa 
framework for COOP. Actors are self-contained, independent, interactive 
components of a computing system. They communicate by asynchronous 
message passing. Message passing is joined with semantics in an actor model. 
The following are the basic actor primitives — 

(a) Create — From a given behaviour description and a set of 


parameters, an actor is created. 
(b) Send-to — A message is sent from one actor to another. 


(c) Become — The behaviour of an actor is replaced by a new 


behaviour. 
The behaviour replacement specifies st 
changes and prevent unnecessary control-flow dependences usin 


mechanism. The visualization of concurrent computations is done in te 
creations, and behaviour 


ssages and 


ate changes. One can aggregate 
g replacement 
rms of 


simultaneous communication events, concurrent actor 
replacements. An actor (object) may modify its state, send new me 
create new objects using message. 

Concurrency control structures are used to express specific patterns of 
message passing. A low-level description of concurrent systems is given by 
the actor primitives. High-level constructs are also required for increasing the 
granularity of descriptions and for encapsulating faults. The actor model '5 
specifically appropriate for multicomputer implementations. 
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gull Explain functional and logical models of parallel computer, 
. l (R.GP.V., June 2010) 
Or 


about function and logic models. 


Describe A (R.GPV., Dec. 2014) 
r 


ional and logic models ? (R.GF.V., June 2015) 


hat is funct 
vi Or 


Write short note on functional and logic model. (R.GP.V., Dec. 2017) 
Functional Model — A functional programming language is concerned 
mctionality of a program. It should not generate side ‘effects after 


o concept of storage, assignment and branching is available in 


Ans. 
with the fi 
execution. N 


functional programs. 
greater opportunity for parallelism in case of less side effects. A 


n application provides precedence restrictions. A function 
evaluation gives the same result irrespective of the order of the evaluation of 
its parameters. It specifies that all parameters in a dynamically created structure 
of a functional program are evaluated parallely. The nature of all single- 
assignment and dataflow languages is functional. It specifies that functional 
programming models are used in data-driven multiprocessors. The functional 
model is referentially transparent. It gives emphasis on fine-grain MIMD 


There is 
result of functio 


parallelism. 
Most of the parallel computers designed to assist the functional model. 
They were oriented toward Lisp. The execution of functional programs has 


been done by other dataflow computers. 
Logic Model — Logic programming depends on predicate logic. It is 


appropriate for knowledge processing dealing ‘with greater databases. The - 
logic programming model uses an implicit search strategy. It also assists 
parallelism in the logic inference process. If there are matching facts in the 
database, a question is to be answered. If the predicates and associated 
parameters of two facts are the same then these facts are said to be matched. 
There can be parallelization of the process of matching and unification under 
specific conditions. In logic programming, the transformation of clauses iS 
done into dataflow graphs. Parallel unification has been tried on. Some dataflow 


computers made in Japan. 
Concurrent Prolog, and Parlog are two parallel logic programme 


languages. The implementation of the relational language features like AND- . 
parallel execution of conjunctive goals, IPC by shared variables and OR- 


, 
4 
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parallel reduction is done by both languages. The resolution tr 


contains one chain at AND levels, and OR levels are ai in Parky 
produced. The search strategy in concurrent Prolog adopts ai OF fu 
or depth first. These logic programming systems may also ite Pat, 

a Strea, 


parallelism. 


Q.12. Explain various parallel programming models, 


(R.GPY, 
Or , June 201 


Write short note on parallel programming models. 
(R.GRV, Dec. 20% 
1 


Ans. Refer to Q.1, Q.2, Q.5, Q.8, Q.10 and Q.11. 


Q.13. Explain three parallel architecture models and compare the; 
merits and demerits. (GBV, June 2 K 
j 


Ans. Three parallel architecture models are — 
(i) Shared Variable Model — Refer to Q.2. 


RN eR Sg aS 


Merits — 
(a) In shared memory, the global address space is given fo 
user-friendly programming approach to memory. = 


gi gee AN a SES 


(b) The data sharing among processes is fast and uniform due 
to the closeness of memory to CPU. So z 
the c'oseness oft memory io wi 
(c) The communication of data among processes is need not to 
specify distinctly. ; 
(d) The process communication overhead is negligible. 


a 


aie 


(e) It is very simple and easy to learn. 


Demerits — 


(a) It is not portable. 
(b) It is very difficult to manage the data locality. 


(ii) Message Passing Model — Refer to Q.5. ~~ 
ee 

Merits — 

(a) It provides low-level control of parallelism. 

(b) It is portable. 


(c) Less error prone. 


(d) Less overhead in parallel synchronization and data 


distribution. - S eae — 
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Demerits — 
sage passing code generally needs more software 


allel shared memory code. 


as © f z 
overhead as is) in parallel algorithm, the programming changes are required 
more. (c) Sometimes it becomes difficult to debug. 

y (d) Does not perform well in the communication network in 
een the nodes. 
betw dii) Object-oriented Model — Refer to Q.10. 
jr 
Merits — 
(a) The object-oriented approach is used to reduced the 
maintenance of the system. — ais 


— ~(b) It is used in real world modeling. 


(c) It improves the reliability and flexibility y of the system. 
(d) It helps in high code reusability. _ 


(a) The shortage and unfamiliar of experienced programmers. 


(d) Proper tools and support are limited. 
(e) It is difficult to implement purely. 


0.14. What do you mean by tuple space model of parallel programming ? 


Write a Linda program for any task graph you assume. 
(R.GPV., June 2017) 


e memory paradigm 


Ans. A tuple space is an implementation of associativ 
ples that 


for parallel or distributed computing. It provides a repository of tu 
can be accessed concurrently. For example, consider a group of processors 
that uses the data and a group of processors that produces that data. Producers 
post their data as tuples in the space, and the consumers then retrieve the data 
from the space that match a certain pattern. This is also known as blackboard 
metaphor. Tuple space may be thought as a form of distributed shared memory. 
It behaves like blackboard or bulletin board where anyone (process) can add a 
notice (tuple) or remove a notice, then the tuple space becomes the global 
mailbox, i.e. its mail (tuple) can be accessed by any process for any purpose 
(like receiving, sending or broadcasting messages). 
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Linda Program — 
begin 
global a, b, c, dye, f 
Read a, b, c, d; 
eval (“Proc T°, cos (Ra)**3)): 
eval (“Proc 2”, cos (e): 
eval (“Proe 3°, exp (h(c)). sin (exp(h(c)))): 
eval (“Proc 4°, P(d)): 
rd (“Proc 1", 2q): 
rd (“Proc 2", ?r); 


rd (“Proc 3", 2s, 20): 
rd (“Proc 4°, ?w); 
eew *stt 


feqtrts; 
write e, f 
end 


0.15. What are CRCW and CREW ? (R.GP.V., June 2015) 


Ans. The CRCW (Concurrent Read, Concurrent Write) PRAM (Paralle| 
Random Access Machine) allows for both concurrent reads and concurrent 
writes. When we use such a model, the details of the concurrent write must 
be specified. The CREW (Concurrent Read, Exclusive Write) PRAM is one 
of the most popular models because it is intuitively appealing to assume that 
concurrent reads may occur, but concurrent writes may not occur. 


0.16. State and prove Amdaht’s law. (R.GP.V., June 2016) 


at 


Ans. Amdahl’s law (1967) is based ona fixed workload ora fixed problem 
size. A weighted harmonic mean speedup is represented by 


s=- i= (i) 


T* (cn, 
Die fi/Ri 


Let R; = i, w = (a, 0, 0, ...., 0, I - a). This means w, = Q, Wy, = l-a, 
and w; = 0 for i # 1 and i # n. It means the system is used either in a fully 
arallel mode using n processors with a probability | — œ, or in a pure sequential 


mode on one processor with a probability a. By putting Ry = | and R, = nand 
w into equation (i), we get 


n 
l+(n-lDa 


S.= 


n 


(ii) 
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it is called Amdahl’s law. The 1024 
implication is that S > I/a as n > ai 
wo. The best speedup one can expect 
is upper bounded by 1/a under the 
above probability assumption. 

We plot equation (ii) as a 
function of n for four values of a i 
as shown in fig. 5.5. The speedup 
performance decreases as the value EET 
of a increases from 0.01 to 0.9. N 


w 
n 
an 


Speedup —=— 
n 
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— 
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ARALLEL LANGUAGES AND COMPILERS, LANGUAGE 
FEATURES FOR PARALLELISM, PARALLEL PROGRAMMING 
ENVIRONMENT, SOFTWARE TOOLS AND ENVIRONMENTS 


CNEL ENARE AR r? 


Se eee 


ee ee we eS BED UAW SN TERA ST At RAILS SOE 


Q.17. Write short note on parallel programming languages. 


Ans. Parallel programming languages are languages designed to program 
algorithms and applications on parallel computers. Parallel processing is a 
great opportunity for developing high performance systems and solving large 
problems in many application areas. During the last few years parallel 
computers ranging from tens to thousands of computing elements became 
commercially available. They continue to gain recognition as powerful tools 
in scientific research, information management, and engineering applications. 
This trend is driven by parallel programming languages and tools that 
contribute to make parallel computers useful in supporting a broad range of 
applications. Many models and languages have been designed and implemented 
to allow the design and development of applications on parallel computers. 
Parallel programming languages (called also concurrent languages) allow 
the design of parallel algorithms as a set of concurrent actions mapped onto 
different computing elements. The cooperation between two or more actions 
can be performed in many ways according to the selected language. The 
design of programming languages and software tools for parallel computers 
is essential for wide diffusion and efficient utilization of these novel 
architectures. High-level languages decrease both the design time and the 
execution time of parallel applications, and make it easier for new users to 
approach parallel computers. 


| 
| 
\ 
| 
! 
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0.18. Write languages name for parallel programming, 
(R.GPY. 
Or > June 2012) 
Give an example of parallel languages. (R.GRV, Jun : 
*y e 0 15) 


Ans. Parallel languages are developed either by introducing new lan 
like Occam and Linda or by extending existing sequential languages like nes 
90, Concurrent Pascal and C*. A new parallel programming language a 
benefit of using high level parallel concepts or constructs for parallelism H 
level parallel constructs were added to Fortran, Pascal, C and Lisp e igh 
them suitable for use on parallel computers. make 
Special optimizing compiler are used to automatically detect parallelis 
and transform sequential constructs into parallel ones. 3 
High-level parallel constructs can be implicitly embedded in the syntax o 
explicitly specified by users. As shown in fig. 5.6, there are three compiler 
approaches — preprocessors, precompilers and parallelizing compilers 
Preprocessors, such as MONMACS and FORCE, use compiler directives ó: 
macroprocessors. Precompilers include automated Alliant FX Fortran 
compilers, the Express C automatic parallelizer and semiautomated compilers 
like PAT, DINO and MIMDizer. 
Parallel Languages 


Extended 


New 


“ Language Extension Precompiler 


Linda, SISAL 


STRAND-88 Automated 


Language 
Feature 


Semiautomated 


Preprocessor 


FORCE, MONMACS, Fortran-90, FX Fortran, DINO, PAT, 
TOPSYS, OLYMPUS, POKER, Ada Express MIM Dizer 
PIE, FAULT, SCHEDULE, 
Myrias, Hypertasking 
Fig. 5.6 . 


0.19. Write short note on parallel compilers. 
(R.GP.V., Dec. 2010, June 2013) 


Or 

What is parallel compiler ? (R.GR.Y., Dec. 2015) 
Ans. Generally speaking, parallel compilers decrease the execution time 

of the program by breaking it up into blocks that may be processed 
simultaneously by the multiple processing units. A parallelizing com 


three phases as shown in fig. 5.7. 


piler has 
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! 
[Source Code | 
Flow Analysis 


Program Optimizations 


Parallel Code Generation 


Distributed-memory 
Multicomputer — 
Distributed Data, and 
Computations, 
Message-passing, ..... 


Shared-memory Multiprocessor — 
Partitioning, Load balancing, 
Synchronization, ..... 


Switching, ---- 


7 Compilation Phases in Parallel Code Generation 


Flow Analysis — This phase shows the program flow patterns 
to find data and control dependences in the source code. The granularities of 
lelism to be exploited are quite different depending on the machine 
Thus the flow analysis is conducted at different execution levels on 


Fig. 5. 


g 


paral 
structure. 
different parallel computers. 

(ii) Program Optimizations — This refers to the transformation of 


re the hardware capabilities as much as possible. 
Transformation can be done at locality level, loop level, or prefetching level with 
the aim of global optimization. The optimization converts a code into an equivalent 
but better form. These transformations should be machine-dependent. Machine- 
dependent transformations are meant to achieve more efficient allocation of 
e aim of program optimization is to enhance the speed of 
tion of code length and of memory 
zations include 


user programs to explo 


machine resources. Th 
code execution. This involves the minimiza 


access and the exploitation of parallelism in programs. Other optimi 
elimination of unnecessary branches or common expressions. 

(iii) Parallel Code Generation — This involves transformation from 
one representation to another known as intermediate form. Parallel code 
generation is very different for different computer classes. Compiler directives 
can be used to help generate parallel code when automated code generation 


cannot be implemented easily. 


0.20. Classify the language features for parallelism. 
Or 

Write short note on features of parallel languages. 
Or 


(R.GP.V, June 2010) 


2 
? 
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Discuss the features of parallel language. (R.GP.V., June 2013, 2016) 


Or 


Discuss features of parallelism. 
Or 


Discuss about the language features for para 


(R.GP.V., Dec. 2014) 


llelism. 
(R.GP.V, Dec. 2015) 


Or 
What are the language features for parallelism 
Or 
Describe the language features for parallelism. (R.GP. V., Dec. 2017) 
programming have been classified 
e basis of functionality, 


? (R.GFV., June 2017) 


Ans. The language features for parallel 
by Chang and Smith in 1990, into six categories on th a 
For general-purpose applications, these features are idealized. | 
(i) Optimization Features — The program restructuring and 
compilation directives are converted from sequentially coded programs into 
parallel forms using optimization features. These features favour code 
parallelization and vectorization at compile time. The matching of the hardware 
parallelism with the software parallelism in the target machine is the main 


objective. 
(a) Semiautomated Parallelizer — It requires compiler 


directives or programmer’s interaction, like DINO. 


(b) Automated Parallelizer — The Express C automated 
parallelizer and the Alliant FX Fortran Compiler are examples. 

(c) Interactive Restructure Support — Run-time statistics, 
static analyzer, dataflow graph and code translator for restructuring Fortran 
code, like the MIMDizer from Pacific Sierra are examples. 

(ii) Availability Features — The availability features widen the 
application domains and make the languages machine-independent. These 
features improve the user-friendliness, make the language portable to a large 
class of parallel computers and increase the applicability of software libraries. 

(a) Portability — It specifies that there is portability of the 
language to message-passing multicomputers, shared-memory multiprocessors 
or both. 

(b) Compatibility — It specifies that there is compability of the 
language with an established sequential language. 


(c) Scalability — The language does not depend on hardware 
topology and is scalable to the number of available processors. 
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Explain synchronous and asynchronous message-passing model. 


0.5. 
. Or 
Explain message passing model briefly. (R.GP.V., Dec. 201 0, 2015) 
i Or 


Draw message-passing model. (R.GPV., June 2012) 


Or 


Write short note on message passing programming model. 


(R.GPV., June 2014) 


Ans. (i) Synchronous Message Passing — In the synchronous message 
passing, the sender process and the receiver process must be synchronized in 
time and space, just like a telephone call using circuit switched lines. Generally, 
o need of buffers in the communication channels. For this reason, 
synchronous communication is blocked by channels being busy or in error 
because only one message Is permitted to be sent through a channel at a time. 

Apart from having a time connection, the sender and receiver must also 
be linked by physical communication channels in space. There must be a path 
of channels, which is ready to enable the message passing between them. In 
other way, the sender and receiver must be coupled in both time and space 
synchronously. If one process wants to communicate and the other does not, 
then the one that wants, is blocked (or wait). In this case, synchronous 
communication is known as a blocking communication scheme. 


(ii) Asynchronous Message Passing — In asynchronous paradigm, 
the passing of a message must not synchronize the sending and the receiving 
process in time and space. Buffers are often used in channels. This results in 
nonblocking in message passing provided adequately large buffers are used or 
the network traffic is not saturated. 

Although, arbitrary communication delays may be found as the sender 
may not know when the message has been received till acknowledgement is 
not received from the receiver. This approach is same as a postal service 
using mailboxes (i.e., channel buffers) having no synchronization between 
senders and reccivers. 

Asynchronous message passing provides nonblocking where two 
processes do not need to be synchronized either in time or in space. The 
sender is permitted to send a message with no blocking, irrespective of whether 
the receiver is ready or not. 

The buffers are used to hold the messages along the path of the connecting 
channels is asynchronous communication. The sender will eventually 
be blocked because channel buffers are finite. Buffers are not required in a 
synchronous multicomputer because only one message is permitted to pass 


there is n 


“Carer 
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through a channel at a time. How to distribute or duplicate the program ¢ 

and data sets over the processing nodes is the crucial issue in program Odes 

this model. There exist tradeoffs between computation time and communicat 
10n 


overhead that must be considered. 


Q.6. Explain shared variable and message passing model. 
(R.GP.V., June 2013, 2015) 
Or 
Explain shared variable model and message passing model in detail, 
(R.GP.V., Dec. 29] 7) 


Ans. Refer to Q.2 and Q.5. 


Q.7. Explain array processing. (R.GPV., June 20] 6) 


Ans. An array processor is a processor that performs computations op 
large arrays of data. The term is used to refer to two different types of 
processors. An attached array processor is an auxiliary processor attached to 
a general-purpose computer. It is intended to improve the performance of the 
host computer in specific numerical computation task. An SIMD array 
processor is a processor that has a single instruction multiple — data organization, 
It manipulates vector instructions by means of multiple functional units 
responding to a common instruction. Although both types of array processors 
manipulate vectors but differ in their internal organization. 

The SIMD form of parallel processing is called array processing. Fig. 5.2 
illustrates array processors. A two-dimensional grid of processing clements 
executes an instruction stream broadcast from a central control processor. As 
each instruction is broadcast, all elements execute it simultancously. Each 
processing element is connected to its four nearest neighbours for purposes 

of exchanging data. Endaround connections may be provided on both rows 
and columns, but they are not shown in the fig. 5.2. 


c o o o 


Control 


Processor 


Broadcast 
Instructions 


e . e Sj 
Grid of 
Processing Elements 


Fig. 5.2 
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Array processors are highly specialized machines. They are well suited to 
numerical problems that can be expressed in matrix or vector format. However, 


they are not very useful in speeding up general computations. 


0.8. Explain the following terms in data-parallel model — 
(i) Data parallelism 
(ii) Array language extensions 
(iii) Compiler support. 
Or 
Write short note on data parallel model. 
Or 


Explain about data-parallel model. 


Ans. (i) Data Parallelism — Programming SIMD array processors has 
been a challenge for computational scientists after the invention of the Illiac- 
IV computer. In IIliac-IV, the key problem has been to match the problem size 
with the fixed machine size. It means that large arrays or matrices must be 
divided into 64 element segments before they can be processed by the 64 


(R.GP.V., Dec. 2010) 


(R.GP.V, Dec. 2014) 


processing elements. 

The SIMD computers, like the Connection Machine CM-2, provided bit- 
slice fine-grain data parallelism using 16,384 PEs simultaneously in a single- 
array configuration. This needs a lower degree of array segmentation. Therefore 
this provided greater flexibility in programming. 

Synchronous SIMD programming is different from asynchronous MIMD 
programming in such a manner that all PEs in an SIMD computer work in a 
lockstep manner, in contrast all processors in an MIMD computer run different 
instructions asynchronously. Consequently, there is no mutual exclusion or 
synchronization problems related with multiprocessors or multicomputers in 
SIMD computers. 

Hardware directly controls inter-PE communications. Inter-PE data 
communication is also conducted in lockstep fashion, apart from lockstep in 
computing operations among all PEs. SIMD computers are rather well in 
exploring spatial parallelism in large arrays, grids, or meshes of data due to 
these synchronized instruction executions and data-routing operations. 


The control unit directly run scalar instructions in an SIMD program. 
The PEs are loaded with vector operands from local memories simultaneously 
with the help ofa global address having various offsets in local index registers. 
Vector stores are run similarly. Constant data is send to all PEs simultancously. 


All PEs can be disabled or enabled dynamically in any instruction cycle 
by setting a masking pattern (i.e., binary vector) under program control. 
Hardware directly assists masking instructions. An inter-PE routing network, 
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which is also controlled by program control on a dynamic basis, also a 
data routing vector operations. s 


Sist 


(ii) Array Language Extensions — The representation of 
extensions in data-parallel languages is done by high-level data types 
array syntax offers the elimination of some nested loops in the code. Thi 
should reflect the architecture of the array processor. The DAP Fortran : 
the AMT/Distributed Array Processor CFD for the Illiac-IV, C* for the TMe, 
Connection Machine, and MPF for the MasPar family of parallel computer 
are the examples of array processing languages. There should be a alóla 
address space in an SIMD programming language. This eliminates the 
requirement for explicit data routing between PEs. There should be ability jp 
the array extensions to make the number of PEs a function of the problem size 
instead of a function of the target machine. 


array 
- The 


iii) Compiler Support — The array language expressions and their 
optimizing compilers are embedded in familiar standards like Fortran 77, Fortran 
90 and C to assist data-parallel programming. The unification of the program 
execution model, enabling incremental migration of data-parallel execution and 
facilitation of precise control of massively parallel hardware are the main ideas, 


The programmer is permitted by the compiler-optimized control of SIMD 
machine hardware to drive the PE array transparently. The program must be 
separated into scalar and parallel components by the compiler and integrated 
with the UNIX environment. 


The array extensions are permitted by the compiler technology to optimize 
data placement, minimize data movement and virtualize the dimensions of the 
PE array. The data-parallel machine code is produced by the compiler to perform 
operations on arrays. 

A programmer is permitted by array sectioning to reference a section or 
a region of a multidimensional array. The designation of array sections is done 
by specifying a start index, a stride, and a bound. Arrays are constructed from 
arbitrary permutations of another array using vector-valued subscripts. Vector- 
valued subscripts are vectors that are used to map the required elements into 
the target array. The implementation of gather and scatter operations is made 
easy by them on a vector of indices. 


For MIMD architecture, the recompilation of SIMD programs is done. 
To develop a source-to-source precompiler to convert, for instance, from 
connection machine C* programs to C programs executing on an nCUBE 
message-passing multicomputer in SPMD mode is the idea. 
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Clearly, a special class of SIMD programs is SPMD programs. They 
rt medium-grain parallelism any synchronization at the subprogram level 


suppor of at the instruction level. In this way both synchronous SIMD and 
n oupid MIMD computers use the data-parallel programming model. 
o 

0.9. 4 
processor an 
Ans. The term parallel processing is an efficient form of processing data 
Sani rmation simultaneously or in parallel. It examines the processor-level 
parallelism in computers and focus on the use of multiple CPUs to achieve 


very high throughput and fault tolerance. 


As we know that the performance of computer increases steadily due to 
the hardware technologies and processor designs. One way to address these 
issues is to exploit processor-level parallelism, for example while building a 
computer, it contains large number of processors in parallel on common tasks. 
Suppose that a computer P(n) is constructed by combining n copies of a 
single computer P(1). If a task T can be partitioned into n subtasks of similar 
complexity and P(n) can be programmed in such a way that its n processors 
execute the n subtasks in parallel, it would expect P, to process T about n 
times faster than P(1) can process it. One of the advantage of processor-level 
parallelism is tolerance of hardware and software faults. A sequential computer 
will always get fatal in case of CPU failure where as a parallel computer can be 
designed to continue functioning, perhaps at a reduced performance level, in 


the presence of defective CPUs. 


Fig. 5.3 One Dimensional Array of n Processors 


Vhat is data parallel architecture ? How it is important in 


d multiprocessor architecture ? (R.GPV., June 2017) 
un 


Parallel processing plays an important role in uniprocessor and 
multiprocessor computers. 


___ In case of uniprocessor computer, the parallel processing mechanisms 
identified in the following six categories — 

(i) Multiplicity of functional units. 

(ii) Parallelism and pipelining within the CPU. 

(ili) Overlapped CPU and I/O operations. 

(iv) Use of a hierarchical memory system. 

(v) Balancing of subsystem bandwidths. 

(vi) Multiprogramming and time sharing. 


È 
] 
a 
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In case of multiprocessor systems, parallel Processing impr 
throughput, reliability, flexibility and availability. The multiprocessg e 
contains two or more processors of approximately comparable c 
Each processor has its own local memory and private de 
communications between the processors can be done through 
memories or through an interrupt network. 


N the 
Ste 

a 
APabilitic 


Vices. The 
the Share, 


The different interconnections used in multiprocessor be 


tween th 
memories and processors are — e 


(i) Time-shared common bus 
(ii) Crossbar switch network 
(iii) Multiport memories. 


Q.10. Discuss the following terms in object-oriented model — 
(i) Concurrent OOP 
(it) Parallelism in COOP 


iit) An actor model. 
Or 
Write short note on object-oriented model. 
(R.GPVM, Dec. 2010, June 2013) 
Or 
Write short note on object-oriented parallel programming model. 
(R.GP.V., June 2011) 
Or 
Explain object-oriented model. (R.GP.V., June 2014) 
Or 
What is object-oriented model ? (R.GP.V., Dec, 2015) 


Ans. (i) Concurrent OOP — There are three application demands, which 
attribute to the object-oriented programming (OOP). First, individual users 
use interacting processes immensely like the use of multiple windows. Second, 
workstation networks are used as a cost-effective mechanism for resource 
sharing and distributed problem solving. Third, there is advancement in the 
multiprocessor technology to the point of offering supercomputing power ata 
fraction of the traditional cost. 


In fact, program abstraction results in program modularity and software 
reusability as is found in OOP. Other areas that supported OOP are the 
development of CAD tools and word processors with graphics capabilities. 


eoa ; f into 
Object is a program entity which encapsulates data and methods int 
single unit. A natural consequence of the concept of objects is the concurrency: 
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t manipulation of objects in OOP is just like the concurrent use 
n 


the ame in conventional programming. 


ative model for concurrent computing on multiprocessors or on 

An ae is provided by the development of COOP. Object models are 

multicom P internal behaviour of objects and their interaction with each 

different 1" 

other. (ii) Para Ilelism in COOP — There are three common patterns of 
arallelism in COOP F f 

(a) Pipeline concurrency is the overlapped enumeration of 


ons and concurrent testing of the solutions because they are 


p 


successive soluti d concur 
derived from an evaluation pipeline. 

(b) Divide-and-conquer concurrency is the concurrent 
elaboration of various subprograms and the combining of their solutions to 
a solution to the overall problem. In this situation, no interaction exists 
between the procedures solving the subproblems. 


These two patterns are shown in fig. 5.4. Fig. 5.4 (a) shows a prime 
number generation pipeline. Ina linear pipeline of primes, integer numbers are 
produced and successively tested for divisibility by previously produced primes. 
The circled numbers indicate produced integers. 


Number 
Generator 


O10101 0101010 


(b) Divide-and-conquer Concurrency 
Fig. 5.4 Concurrency Types in Object-oriented Programming 


a 
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A number is entered in the pipeline from the left end and is removia:.. | 
is divisible by the prime number tested at a pipeline stage, Now wifi 
numbers, which are indivisible by all the prime numbers tested on ih We get, 
that stage. € left of 


The multiplication of a list of numbers (4, 3, 5, —7, 6, 2, 9, 8) usi 
divide-and-conquer method is shown in fig. 5.4 (b). The leaves 8 
express the numbers. The problem is recursively partitioned into i 
of multiplying two sublists. Each sublist is concurrently evaluated a tu 
results multiplied at the upper node. nd the 


(c) Cooperative problem solving is the third Pattern, Th 
dynamic path evaluation of various physical bodies under the mutual e i 
of gravitational fields is a simple example. In this situation, each object as 
interact with another. Objects store intermediate results and share them by 
passing messages. 


(iii) An Actor Model — An actor model was developed at MIT. It isa 
framework for COOP. Actors are self-contained, independent, interactive 
components of a computing system. They communicate by asynchronous 
message passing. Message passing is joined with semantics in an actor model, 
The following are the basic actor primitives — 


(a) Create — From a given behaviour description and a set of 
parameters, an actor is created. 


(b) Send-to — A message is sent from one actor to another. 


(c) Become — The behaviour of an actor is replaced by a new 
behaviour. 

The behaviour replacement specifies state changes. One can aggregate 
changes and prevent unnecessary control-flow dependences using replacement 
mechanism. The visualization of concurrent computations is done in terms of 
simultaneous communication events, concurrent actor creations, and behaviour 
replacements. An actor (object) may modify its state, send new messages and 
create new objects using message. 

Concurrency control structures are used to express specific patterns of 
message passing. A low-level description of concurrent systems is given by 
the actor primitives. High-level constructs are also required for increasing the 
granularity of descriptions and for encapsulating faults. The actor model 6 
specifically appropriate for multicomputer implementations. 
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functional and logical models of parallel computer. 
(R.GP.V., June 2010) 


Q. I1. Explain 


Or 
3 i d logic models. (R.GP.V., Dec. 2014) 

pescribe about function an be 
a i dels ? (R.GB.V., June 2015) 

is unctional and logic mo 

what is fi br 
Write short note on functional and logic model. (R.GP.V., Dec. 2017) 
Ans. Functional Mode! — A functional programming language is concerned 


ty of a program. It should not generate side effects after 


ith the unctionali f ae A i 
with f ge, assignment and branching is available in 


execution. NO concept of stora 


functional programs. 
There is greater opportunity for parallelism in case of less side effects. A 


result of function application provides precedence restrictions. A function 
evaluation gives the same result irrespective of the order of the evaluation of 
its parameters. It specifies that all parameters ina dynamically created structure 
functional program are evaluated parallely. The nature of all single- 
assignment and dataflow languages is functional. It specifies that functional 
programming models are used in data-driven multiprocessors. The functional 
model is referentially transparent. It gives emphasis on fine-grain MIMD 


ofa 


parallelism. 

Most of the parallel computers designed to assist the functional model. 
They were oriented toward Lisp. The execution of functional programs has 
been done by other dataflow computers. 


Logic Model — Logic programming depends on predicate logic. It is 
appropriate for knowledge processing dealing with greater databases. The 
logic programming model uses an implicit search strategy. It also assists 
parallelism in the logic inference process. If there are matching facts in the 
database, a question is to be answered. If the predicates and associated 
parameters of two facts are the same then these facts are said to be matched. 
There can be parallelization of the process of matching and unification under 
specific conditions. In logic programming, the transformation of clauses is 
done into dataflow graphs. Parallel unification has been tried on. Some dataflow 
computers made in Japan. 


Concurrent Prolog, and Parlog are two parallel logic programming 
languages. The implementation of the relational language features like AND- 
parallel execution of conjunctive goals, IPC by shared variables and OR- 
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parallel reduction is done by both languages. The resolution tree jp P 
contains one chain at AND levels, and OR levels are partially or 
produced. The search strategy in concurrent Prolog adopts multiple 
or depth first. These logic programming systems may also avail ș 
parallelism. 


arlog 
fully 
Paths 
tream 


0.12. Explain various parallel programming models. 


(R.GRV, J 
Or “ne 2011) 


Write short note on parallel programming models. 
(R. GP. V, Dec. 2016) 


Ans. Refer to Q.1, Q.2, Q.5, Q.8, Q.10 and Q.11. 


Q.13. Explain three parallel architecture models and compare the; 
merits and demerits. (R.GPV, June 2015) 


Ans. Three parallel architecture models are — 
(i) Shared Variable Model — Refer to Q.2. 


Merits — 


(a) In shared memory, the global address space is given for 
user-friendly programming approach to memory. 


(b) The data sharing among processes is fast and uniform due 
to the closeness of memory to CPU. 


(c) The communication of data among processes is need not to 
specify distinctly. 
(d) The process communication overhead is negligible. 
(e) It is very simple and casy to learn. 
Demerits — 
(a) It is not portable. 
(b) It is very difficult to manage the data locality. 
(ii) Message Passing Model — Refer to Q.5. 
Merits — 
(a) It provides low-level control of parallelism. 
(b) It is portable. 
(c) Less error prone. 


(d) Less overhead in parallel synchronization and data 


distribution. 
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Demerits - 
(a) The message passing code generally needs more software 


as compared to parallel shared memory code. 
(b) In parallel algorithm, the programming changes are required 


(c) Sometimes it becomes difficult to debug. 
(d) Does not perform well in the communication network in 
een the nodes. 
be (iii) Object-oriented Model — Refer to Q.10. 
Merits — 
(a) The object-oriented approach is used to reduced the 
nance of the system. 
(b) It is used in real world modeling. 


(c) It improves the reliability and flexibility of the system. 


mainte 


(d) It helps in high code reusability. 


Demerits — 
(a) The shortage and unfamiliar of experienced programmers. 


(b) Limited consensus on the standards to use. 
(c) It efficiency is low while dealing with simple data. 
(d) Proper tools and support are limited. 


(e) It is difficult to implement purely. 


0.14, What do you mean by tuple space model of parallel programming ? 


Write a Linda program for any task graph you assume. 
(R.GP.V., June 2017) 
Ans. A tuple space is an implementation of associative memory paradigm 
for parallel or distributed computing. It provides a repository of tuples that 
can be accessed concurrently. For example, consider a group of processors 
that uses the data and a group of processors that produces that data. Producers 
post their data as tuples in the space, and the consumers then retrieve the data 
from the space that match a certain pattern. This is also known as blackboard 
metaphor. Tuple space may be thought as a form of distributed shared memory. 
It behaves like blackboard or bulletin board where anyone (process) can add a 
oe (tuple) or remove a notice, then the tuple space becomes the global 
(i ia at Le. its mail (tuple) can be accessed by any process for any purpose 
eiving, sending or broadcasting messages). 


bey 
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Linda Program — 
begin 
global a, b, c, d, e, f; 
Read a, b, c, d; 
eval (“Proc 1”, cos (f(a)**3)); 
eval (“Proc 2”, cos (g(b))); 
eval (“Proc 3”, exp (h(c)), sin (exp(h(c)))); 
eval (“Proc 4”, P(d)); 
rd (“Proc 1”, ?q); 
rd (“Proc 2”, ?r); 
rd (“Proc 3”, ?s, ?t); 
rd (“Proc 4”, ?w); 
e<cw*stt; 
Peo gar P's; 
write e, f 
end 


0.15. What are CRCW and CREW ? (R.GPV, June 2015 


Ans. The CRCW (Concurrent Read, Concurrent Write) PRAM (Paral 
Random Access Machine) allows for both concurrent reads and Sannia 
writes. When we use such a model, the details of the concurrent write cg 
be specified. The CREW (Concurrent R sad, Exclusive Write) PRAM is on 
of the most popular models because it is intuitively appealing to assume thz 
concurrent reads may occur, but concurrent writes may not occur. 


0.16. State and prove Amdahvl’s law. (R.GP.V, June 2016 


Ans. Amdahl’s law (1967) is based ona fixed workload or a fixed problem 
size. A weighted harmonic mean speedup is represented by 


S= Th = wll) 


T* np 
p fi/ Ri) 
Let R; = i, w = (a, 0, 9, .….., 0, 1 — a). This means w; = &, Wn 7 lt 
and w; = 0 for i + 1 and i # n. It means the system is used either in a w 
parallel mode using n processors with a probability 1 — œ, or in a pure sape 
mode on one processor with a probability a. By putting Ry = l and R, = 14" 


w into equation (i), we get 


n (ii) 


S, = ———— 


l+(n-l)a 
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Amdahl’s law. The ~ 1024 
s > l/aasn => W 
ne can expect 
a under the 


robabilit 


fn for four values ofa 
in fig. 5-5- The speedup 
ases as the value 
0.01 to 0.9. 


4 16 64 256 1024 n== 
Fig. 5.5 


BELMONT E RYO, I E 


GUAGES AND COMPILERS, LANGUAGE 


EL LAN 
PARAS FOR PARALLELISM, PARALLEL PROGRAMMING 
FES ONMENT, SOFTWARE TOOLS AND ENVIRONMENTS 


prite short note on parallel programming languages. 


anguages are languages designed to program 
| computers. Parallel processing is a 
great opportunity for developing high performance systems and solving large 
problems in many application areas. During the last few years parallel 
computers ranging from tens to thousands of computing elements became 
commercially available. They continue to gain recognition as powerful tools 
in scientific research, information management, and engineering applications. 
This trend is driven by parallel programming languages and tools that 
contribute to make parallel computers useful in supporting a broad range of 
applications. Many models and languages have been designed and implemented 
to allow the design and development of applications on parallel computers. 
Parallel programming languages (called also concurrent languages) allow 
the design of parallel algorithms as a set of concurrent actions mapped onto 
different computing elements. The cooperation between two or more actions 
e peie in many ways according to the selected language. The 
is Bacay eee languages and software tools for parallel computers 
athe oe diffusion and efficient utilization of these novel 
exétition lmg a - sth steak ena decrease both the design time and the 
approach parallel se el applications, and make it easier for new users to 
puters. 


g.17. 3 
Ans. Parallel programming l 
orithms and applications on paralle 


alg 
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0.18. Write languages name for parallel programming, 


(R.GPY, J 
Or ‘me 29] Y 
Give an example of parallel languages. (R.GPY, June 20 
Ans. Parallel languages are developed either by introducing new ies 15) 
Aes 


like Occam and Linda or by extending existing sequential languages like Fo 
90, Concurrent Pascal and C*. A new parallel programming language ha 
benefit of using high level parallel concepts or constructs for Parallelism 
level parallel constructs were added to Fortran, Pascal, C and Lisp to 
them suitable for use on parallel computers. 

Special optimizing compiler are used to automatically detect paral] 
and transform sequential constructs into parallel ones. 

High-level parallel constructs can be implicitly embedded in the syntax g 
explicitly specified by users. As shown in fig. 5.6, there are three compile, 
approaches — preprocessors, precompilers and parallelizing compiler 
Preprocessors, such as MONMACS and FORCE, use compiler directives i 
macroprocessors. Precompilers include automated Alliant FX Fortran 
compilers, the Express C automatic parallelizer and semiautomated compilers 
like PAT, DINO and MIMDizer. 


Parallel Languages 


tran 
S the 
-Hi gh 
Make 


elism 


New Extended 


Language Extension Precompiler 


Linda, SISAL 
STRAND-88 Language Automated 
Preprocessor Feature Semiautomated 

FORCE, MONMACS, Fortran-90, FX Fortran, DINO, PAT, 

TOPSYS, OLYMPUS, POKER, Ada Express MIMDizer 
PIE, FAULT, SCHEDULE, 
Myrias, Hypertasking 
Fig. 5.6 


0.19. Write short note on parallel compilers. 
(R.GB.V., Dec. 2010, June 2013) 
Or 


What is parallel compiler ? (R.GBV., Dee. 2015) 


Ans. Generally speaking, parallel compilers decrease the execution time 
of the program by breaking it up into blocks that may be processed 
simultaneously by the multiple processing units. A parallelizing compiler hi 
three phases as shown in fig. 5.7. 


er lt 
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Source Code 


Flow Analysis 


Program Optimizations | 


Parallel Code Generation 


Distributed-memory 
Multicomputer — 
Distributed Data, and 
Computations, 
Message-passing, ... 


Shared-memory Multiprocessor — 
Partitioning, Load balancing, 
Synchronization, ..... 


erscalar Processor — 
ing, Register 


Switching, s. 


Fig. 5.7 Compilation Phases in Parallel Code Generation 


(i) Flow Analysis — This phase shows the program flow patterns 
to find data and control dependences in the source code. The granularities of 
parallelism to be exploited are quite different depending on the machine 


structure. Thus the flow analysis is conducted at different execution levels on 


different parallel computers. 

(ii) Program Optimizations — This refers to the transformation of 
user programs to explore the hardware capabilities as much as possible. 
Transformation can be done at locality level, loop level, or prefetching level with 
the aim of global optimization. The optimization converts a code into an equivalent 
but better form. These transformations should be machine-dependent. Machine- 
dependent transformations are meant to achieve more efficient allocation of 
machine resources. The aim of program optimization is to enhance the speed of 
code execution. This involves the minimization of code length and of memory 
access and the exploitation of parallelism in programs. Other optimizations include 
elimination of unnecessary branches or common expressions. 


(iii) Parallel Code Generation — This involves transformation from 
one representation to another known as intermediate form. Parallel code 
generation is very different for different computer classes. Compiler directives 


can be used to help generate parallel code when automated code generation 
cannot be implemented easily. 


9.20. Classify the language features for parallelism. 
Or 

Write short note on features of parallel languages. (R.GPV., June 2010) 
Or 
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Discuss features of parallelism. 
Or . 2014, 
Discuss about the language features for parallelism. 
(R.GPV, Doo 
Or » Dec, 24 ly 


What are the language features for parallelism ? (R.G Py, Ju 
Or one 


Describe the language features for parallelism. (R.GP.V, Deg A 
me ee 201) 


Ans. The language features for parallel programming have been clas 
by Chang and Smith in 1990, into six categories on the basis of fhe, 
For general-purpose applications, these features are idealized. i 


(i) Optimization Features — The program restructuring , 
compilation directives are converted from sequentially coded a X 
parallel forms using optimization features. These features favour an 
parallelization and vectorization at compile time. The matching of the hardin 
parallelism with the software parallelism in the target machine is the nan 


e 2017) 


ality, 


objective. 

(a) Semiautomated Parallelizer — It requires compiler 
directives or programmer’s interaction, like DINO. 

(b) Automated Parallelizer — The Express C automated 
parallelizer and the Alliant FX Fortran Compiler are examples. 


(c) Interactive Restructure Support — Run-time statistics, 
static analyzer, dataflow graph and code translator for restructuring Fortran 
code, like the MIMDizer from Pacific Sierra are examples. 


(ii) Availability Features — The availability features widen the 
application domains and make the languages machine-independent. These 
features improve the user-fricndliness, make the language portable to a large 
class of parallel computers and increase the applicability of software libraries. 
lity of the 


(a) Portability — It specifies that there is portabi 
essors 


language to message-passing multicomputers, shared-memory multiproc 
or both. 

(b) Compatibility — It specifies that there is compability of the 
language with an established sequential language. 

(c) Scalability - The language does not depend on hardware 


topology and is scalable to the number of available processors. 


ified 
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semaphores, monitors 


(h) Dataflow languages. 
Control of Parallelism — The control parallelism features usually 
memory demand and communication and 


ong grain size, 
rhead. The features involving control constructs to specify 


-n different forms are — ' i l 
(a) Coarse grain, medium grain and fine grain 
(b) Global parallelism in the whole program 
(c) Implicit versus explicit parallelism 
(d) Loop parallelism 
(e) Queue of shared task 
(f) Job-split parallelism 
(g) Divide-and-conquer technique 
(h) Specification of job dependency 
(i) Types of shared abstract data. 

(v) Data Parallelism Features — The fine-grain computations on 
SIMD machines and medium grain on MIMD computers are achieved by data 
parallelism. Data parallelism features are utilized for specifying dala access 
and distribution in SIMD and MIMD computers. 

(a) Run-time Automatic Decomposition — There is no user 

intervention in distributing data automatically, For example, Express. 

(b) Virtual Processor Support — The mapping of the virtual 
processors is done dynamically or statically onto the physica! processors. For 
examples, PISCES 2 and DINO. 
specify Fe sari Speerin = A facility is given for users to 
hardware, For example a or mapping of data and processes onto the 
et a to Shared Data — The access to shared data 

onitor control, For example Linda. 


(e) SP ; ; ‘ 
Hypertasking nie a Support - SPMD programming. For example: 


is done directl 
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(vi) Process Management Features =p 


ean S rocess Manape 
are required to assist the parallel processes creations sr 
multitasking implementation, program decomponing ‘i 


Ment fj 
A Alun 
Multithre; i Urey 


3 : Wd repli: 
dynamic load balancing at run time, ang replication 


(a) Run time creation of dynamic process 


(b) Replicated Workers ~ It means that there 


on cach node having different data (SPMD mode) 


. (e) Lightweight Processes (Threads) — C 
heavyweight processes in UNIX. omen 


IS Similar Progr; 


. (d) Automatic Load Balancing — The migration of w 
is done dynamically among busy and idle nodes to get the similar ie 
work at different processor nodes. 


ad 
amount of 


(c) Partitioned Networks — There might be more than one 
on cach processor node and all process nodes might execute vant gin 
ANOUS processes, 
Q.21. Write in brief on parallel languages and explain the features 
parallel language for parallelism. (R.GPYV. ey ; 
Or " 
Discuss about parallel languages. Also write its features. 
RGR. , 
| Ans. Refer to Q.17 and Q.20. i E 


0.22. What are the features of control of parallelism ? 
(R.GPV, Dec. 2016) 


Ans. Refer to ans. of Q.20 (iv). 
0.23. Define the term loop skewing. (R.GP.V., Dec. 2016) 
Ans, Loop skewing is a loop transformation method which rearrange the 


execution order of loop and convert it into skew execution. It skew the inner 


loop with respect to outer loop. The factor of transformation is 1 and its 
transformation matrix is 


1 0 
1 l 
For example 
Simple Loop Transformed Loop 
l Doj=1,M Doj=1,M 
Dok=1,M = Dok=1,M , 
XG, k) = XG, k- 1) X(j, k-j) = XG, k-J-}) 
+ X(j-1,k) +X(j-1,k-) 
EndDo EndDo 
EndDo EndDo 


LEE 
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pxplain the following loop transformations and discuss how to 
2d, ENP 


for loop yectorization of parallelization ~ 
Loop permutation (ii) Loop reversal 
cality optimization (iv) Sofiware pipelining. 
(RGEV., Dec. 2016) 
~ Loop permutation transform the execution 
) where 6 is called permutation. 


Q. 
ly them 
(i) 

(iii) Lo 


app 


Ans. () Loop Permutation 


n (Pis + Pn) 10 (Dats +> Pon 


„r of loop fro! S | Pe a , re , 
order of 100P atrix which transform (i, j) iteration to (J, 1) iteration. 


J; is permutted identity m 
i can be written as 
0 Ty k 


1 ollk] [J 


For example 


yo ate 
Simple loop Permutated loop 


Doj=1,N Do k = I, N 
Dok =1,N => Doj=1,N 
Y(k) = Y(k) + ZG, k) Y(k) = Y(k) + ZG, K) 
EndDo EndDo 
EndDo EndDo 


Increase (k) Increase (k) 


Increase (j) 
perce LS 
Increase (j) 
oe etches La 


(a) Simple Loop (b) Permuted Loop 


Fig. 5.8 
(ii) Loop Reversal ~ Loop reversal reverse the execution order of 
i 
loop and identity matrix after loop reversal will be 0-11" 


LON. ted 


So, o -illk|  |-k 


For example 


Simple loop Reversed loop 


Doj=1,N Dok=1,N 
Dok=1,N => Doj=~—N,-1 
yj, Wa AO 7c kek 4D YG, -k) = XG -1,-k +1) 
EndDo EndDo 
EndDo EndDo 


244 
Advance Computer Architecture (VI- 


Increase (k) 
a 


Increase (j) 


Increase (j) 


Sem) 


Increase (k) 


= 


(a) Simple Loop for Y (6) Loop after Reversal 
Mig. 5.9 


memory 
Tiling convert n-deep loop into 2n-deep 


Joop. 


(iii) Locality Optimization — Locality optimizatio 
= access penalties tiling technique is used for loc 


Let us take an example of matrix multiplication. 


Simple matrix multiplication 
Doi=1,N 
Doj=1,N 
Dok=1,N 


C(i, k) = C(i, k) + A(i, j) x BÚ, k) 


EndDo 
EndDo 
EndDo 
Tiled matrix multiplication 
Do g=1,N,8 
Do h=1,N,s 
Doi=1,N 
Do j = g, min(g + s — 


1, N) 


Do k = h, min(h + s — 1, N) 
C(i, k) = C(i, k) + A(i, j) * BG, k) 


EndDo 
EndDo 
EndDo 
EndDo 
EndDo 
The number of interleaving 
multiplication so that data is fatched be 
data available 
(iv) Software Pipelini 
of loop of source program in pipelined m 


anner. 


iterations are reduced in this t 
tween the reuse of data, wh 
in cache memory. It reduces memory access. 
ing — It is used to execute suc 


n is USed to redu 
ality Optimizatgg 


iled matr 
ich make 


cessive iteratiot 


s 4 cycle for | iteration 
cycle required = 4N 


ion with pipeline 


Since it require 
= for N iteration, 
(b) Execut 


Read A[!] Read A[2] 
LB,C ca 
MU Read A[3] 


AddA[1],B | MULB,C [3] 
Write A[!] Add A(2], B | MULB,C 
Write A[2] Add A[3], B 
Write A[3] 


Hence, for N iteration it requires N + 3 cycles for execution. 
Cycle required (without pipeline) 
Cycle required (with pipeline) 
Thus, speed-up factor for 3 iteration 12/6 = 2 is achieved. 
and for N iteration 4N/(N + 3) is achieved. 


Since, Speed-up = 


0.25. What is parallel programming environment ? 


: Ans. A parallel programming environment consists of hardware platforms 
nguages supported, OS and software tools, and application packages 


Q.26. What is an ideal Situation 


Bruises 2 
computer ? Give a suitable example. a k Shap She 


(R.GPV., Dec. 2016) 


Ans. An ideal situati i 
. Situation for a user in Programming parallel] computer would 


(i) His need 
Specification language a Meme ad SeS specified by him using a 


i : 
(ii) Logic correctness of specification is con 


firmed by a system program. 


k m E o 
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„_ (F) One more system Program picks an aleon 


specification into machine instructions. Select; gorithm and ¢ 


or contro] elements (CEs), the instru 
which minimises the execution time, 


d"y f daly 
dt” dt? +.--#BG(y,t) 

In this precision of solution, functions [g0 t) ( 

f(t)] and initial conditions are specified. Also, some ao * is fics PNAN) 
natural freguencies, stability problems, ete. might be ea Me 
system program confirms the consistency and completeness of ae ie 
given by the user. More knowledge can be obtained from on ues 
The expert system which have experience to solve multiple types dima 
equations selects the most appropriate algorithm to solve the potion 
best allocation task to processing elements or control elements is also bah bg 
by expert system and with error bound and other relevant data out ut ane 


Al) these ideal situations ensures the availability of qualitative Specification 
techniques, expert’s involvement in problem solving ete. 


0.27. Discuss software tool types for parallel programming. 
Or 


Write short note on parallel software tools. (R.GRV., June 2011) 


Ans. Classification of types of environment on the line between pure 
languages and integrated environments is shown in fig. 5.10. Integrated 
environment is composed of an editor, a debugger, performance monitors and 
a program visualizer to enhance software productivity and the quality of 
application programs, like the packages Express and TOPSYS. Integrated 
there can be partitioning of integrated environment into different classes based 
on the maturity of tool sets, 

(i) A program tracing facility is given by a basic environment for 
performance monitoring and debugging or a graphic mechanisin to represent 
the task dependence graph in SCHEDULE, the process component graph in 
PIE, and the process call graph in FAUST. 

Gi) The tools are offered by limited integration for parallel debugging, 
performance monitoring or program visualization more extreme than the 
capability of the basic environments, 


La O 
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d by well-developed environments for 


n i sentations ofa parallel 
(i) The tools ae of textual/graphical ae = a 
gging Pre p aane for performance monitoring, prog 
gams 1 assis 
debus% 7 o- yalizatio 


- tc. 

cs and I/O, € : 
-zation phi certain tool depends on time. For a 
ae ation of @ be an integrated environment someday 1 
gapu ok ftools in it. For instance, C and Fortran 


-Linda to write parallel programs with 


Software 
Tool Type 


Pure Parallel 


Integrated Languages 


Environments 


ee 
Well Limite 


Linda, DINO, 

3 PIE, SPISCES-2, ; 

Ce MIMDizer, SCHEDE Laie 

TOPSYS PAT, Myrias cones 5 Her 
MONMACS, Hypertasking 
OLYMPUS 

Fig. 5.10 Software Tool Types for Parallel Programming 
Name the software tools 


0.28. Write the features of parallel languages. 
which support parallelism. 


Ans. Refer to Q.20 and Q.27. 


0.29. Give some important environment features. 
Or 


i q ing environment. 
Write features of parallel programming R GPV, June 2014) 


Ans. The important environment features are given below — 


i trol-flow graph creation. 
© ae code level, there is a parallel debugger. 
(iii) Integrated graphical/textual map. 
(iv) Model of performance prediction. 
(v) Performance monitoring using software or hardware. 


vi) Parallel VO for fast moment of data. 
ti) Program visualizer to depict program structures and data flow 


(R.GPV., Dec. 2010) 


patterns. 


248 Advance Computer Architecture (VI- 


environments, 


and development of programs, 


Sem) 


(viii) OS assistance for parallelism in front 


(ix) Visualization assistance for guidance 


for par. 
allel ¢ 
OMPutationg 
(x) Communication assistance in a 


Q.30. Write short note on software tools and environm 


ent, 
(R.GBL, 


June 20 10, 29 13) 


Or 
Discuss about the Software tools and enyi 


Ans. Refer to Q.25, Q.27 and Q.29. 


Q.31. Write short note on parallel programming environment 


* (R.GPYV, June 2011) 


uning environment and tools, 
(R.GPY, Dee, 2014) 


Explain about Parallel progran 


Ans. Refer to Q.25, Q.27 and Q.29, 


0.32. Write short note on parallel algorithm for quicksort, 
l (R.GPV., Dec. 2010) 


Ans. 
Step (i) — Start 
Step (ii) — Quicksort ([X | XS], YS) < 


Partition (XS?, X, Smaller, Larger), 
Step (iii) — Quicksort (Smaller ?, SS), 
Step (iv) — Quicksort (Larger 2, LS), 
Step (v) — Append (SS?, [X | LS?], YS). 
Step (vi) — Quicksort ([ ],[ J). 
Step (vii) — Partition ([Y | IN], X, [Y | Smaller], Larger) + 
X2Y | Partition (LN?, X, Smaller, Larger). 
Step (viii) — Partition ([Y | IN], X, Smaller, [Y | Larger]) + 
X<Y | Partition (LN?, X, Smaller, Larger). 
Step (ix) — Partition ([ ], X, [ ], [ J). 
Step (x) — Append ([X | XS], YS, [X | ZS]) < 
Append (XS?, YS, ZS). 
Step (xi) — Append ({ ], XS, XS). 
Step (xii) — End 


38 E 36 


Note : 


1. 


2. (a) Differentiate between the following — 


` ‘, De 10 
i mester) EXAMINATION, Dec., 20 
BE. GDN SS (New Scheme) aa 
ter Science & Engg. ranc 
ADVANCE COMPUTER ARCHITECTURE 
[CS—605(N)] 


one question from each Unit. 
Unit-I . 
‘der the execution of an object code with 200000 instructions 
crt rocessor. The program consists of four major types 
cee : ui P The instruction mix and the number of cycles (CPI) 
2 ee instruction type are given below based on the result 
net program trace experiment — 10 


aE tastruction Type | CPI | Instruction Mix 
60% 


Arithmetic & Logic 


Attempt 


(a) Co 


0 
Load/Store with Cache hit 
Branch rae 


Memory reference with cache miss 


(See Unit-I, Page 19, Prob.1) 
(b) Compare and comment on static interconnection networks in terms 


Calculate average CPI and MIPS. 


i isecti idth. 10 

de degree, network diameter and bisection wi 
orno f (See Unit-I, Page 34, Q.34) 
10 


(i) COMA and NUMA (See Unit-I, Page 15, Q.11) 
i i tion architecture. 
ii) Binary tree and fat tree interconnec 
s Vis (See Unit-I, Page 39, Q.35) 


(b) Analyse the data dependency among the following statements F : 
given program — 
Sı : Load Ry, 1024 
Sy : Load Ry, M(10) 
S; : Add Ry, R3 
S4 : Store M (1024), Ry 


. M((R2)), 1024 : 
P l ee - dependence graph to show all dependencies. 
i 


ii) Are there any resource dependencics, if only one copy of each 
(ii) functional unit is available in the CPU. 
(See Unit-I, Page 33, Prob.5) 


(1) 


pS 
t 


a oy eS 


menegorreces iy 


Advance Computer Architecture 
3 a AET . f Unit-II 
a FAUN 5 me ibn p Toperty and mem 
(b D R tlevel memory hierarhcy. ory Cohrence Teon; 
) Describe the Addressi (See Unit-q Quire, 
syste ing and Timing > Page gs a, ts 
4 JA S Protocols of Bag C2010 
- Explain the structures and o . (See Unit-1y, p Ckplane bus 
pipelines used in CISC Perational requirements . Sge 78, Q35) 19 
processors. > Scalar RISC, Superscalar k MStruction 
(See Unit-IN, Page jot VLIW 
5. (a) What i Unit-H1 Page 106, Q147 
: at 1s superscalar pipelin va 


(b) 
7. (a) 
(b) 
8. (a) 


ed 
degree of superscalar design ? processors ? What fac 


Explain Tomasulo’s algorithm for d i 


Consider the fi (See Unit-I, p uling 19 

e lve stage pipeli “Il, Page 122,02 

reservation table — Pipelined processor specified by the la. 
10 


(i) 
(ii) 


cal the set of forbidden latencies and collision vector. 

raw a State transiti i i il 

ca a ansition diagram showing all possile initial 

a vithout causing collision in the pipeline. 

(ii) List all the simple cycles. 

(iv) Identify the greedy cycles. 

(v) What is MAL? 

(vi) What will be the maximum throughput ? 

Bias APN te a, (See Unit-II, Page 104, Prob.7) 
plain Static Arithmetic pipelines. (See Unit-III, Page 128, Q.35) 10 

f Unit-IV 

Explain message routing schemes in multicomputer network. 10 
(See Unit-IV, Page 170, Q.20) 

Discuss and compare the merits and demerits of snoopy bus protocols 

and directory based protocols. (See Unit-IV, Page 170, Q.18) 10 


Explain message routing schemes in multicomputer network. 7 
(See Unit-IV, Page 170, Q.20) 


(b) write 


10. Write short 
(a) Object oriented model 


(b) Parallel compilers 
(c) Shared variable model 


(d) Data parallel model 
(e) Parallel algorithm for quicksort. 


poe] 


Note : 
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xt processors ?(See Unit-IV, Page 208, Q.58)5 
| channel in multicomputer networks. 8 
(See Unit-IV, Page 177, Q.28) 


5 iple-conte 
yhat is multip 
b) Y 


Unit-V 

ing model briefly.(See Unit-V, Page 225, Q.5) 10 
a llel languages. Name the software tools 
which support parallelism. (See Unit-V, Page 247, Q.28) 10 
notes on any four of the following — 20 
(See Unit-V, Page 230, Q.10) 
(See Unit-V, Page 238, Q.16) 

(See Unit-V, Page 219, Q.2) 

(See Unit-V, Page 227, Q.8) 
(See Unit-V, Page 248, Q.32) 


ain message pass 
the features of para 


B.E. (Sixth Semester) EXAMINATION, June, 2011 
(Computer Science & Engg. Branch) 
ADVANCE COMPUTER ARCHITECTURE 
[CS—605(N)] 


Attempt one question from each Unit. All questions carry equal marks. 
Unit-I 


1. (a) Explain Flynn’s classification based on multiplicity of instruction 


stream and data stream. (See Unit-I, Page 3, Q.1) 10 
(b) Explain the following terms to measure performance of computer 

system — 10 

(i) Clock rate and CPI (Cycle Per Instruction) 

(ii) MIPS (Million Instruction Per Second) rate 

(iii) Throughput rate 

(iv) Performance factor. 

(See Unit-I, Page 7, Q.4) 


2. (a) Explain the architectural operations of SIMD and MIMD computers. 


Distinguish between multiprocessor and multicomputers based on 

their structure. (See Unit-I, Page 15, Q.12) 10 
(b) What is Interconnection Network ? Explain different interconnection 

network architectures comparing their architectural features. 10 
(See Unit-I, Page 52 
Unit-I1 eee) 


(a) Explain (3) with cache design — 10 


Advance Computer Architecture 


6. 


(b) 


(a) 


(b) 


(i) Write through versus write back cache 


(See Unit-IV, Page 153 

(See Unit-Iv, Page Iss 9° 

rrupt w.r. to backplane 
s 


(See Unit-Il, Page 

’ a: ’ 84, 

Explain Interleaved memory organization. J ustify the <i of hth A on 
er, Caved 


memory organization 
; j (See Unit-II, P4 
Explain MESI protocol for cache with s age 71, Q.28) 19 


(ii) Factors affecting cache hit ratios. 


Discuss arbitration, transaction and inte 
system. 


uitable exa 
(See Unit-IV, Page 156, ear 


a Unit-III 

xplain the following approaches t 

EO s to the branch problem in pipeline 
(i) Branch elimination 10 


(ii) Branch prediction 
(iii) Branch target. 


(See Unit-III, Page 116, Q.19) 


Distinguish between the following — 
10 


(i) Arithmetic and instruction Pipeline 
(ii) Unifunctional and multifunctional pipeline 
(iii) Static and dynamic pipeline 
(iv) Scalar and vector pipeline. 
j l (See Unit-III, Page 132, Q.37) 

Explain the working of arithmetic pipeline with suitable example. 10 

l (See Unit-III, Page 128, Q.35) 
Consider the following reservation table for 4 stage pipeline with 
clock cycle P = 20 ns — 10 


(i) What are the forbidden latencies and initial collision vector ? 


(ii) Draw state transition diagram. 
(iii) Determine the MAL associated with the shortest greedy cycle. 
(iv) Determine the pipeline throughput corresponding to the MAL 


and given P. 
(See Unit-LII, Page 105, Prob.8) 


(b) 


10. Write short notes on the following — 


(i) 


(ii) 


(iii) 
(iv) 


Note : 


Le x(a) 
(b) 
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Unit-IV 
hared memory model. 10 
(See Unit-IV, Page 196, Q.47) 
? Explain the characteristics of vector 
(See Unit-IV, Page 183, Q.33) 10 
ge routing schemes in multicomputer network. 10 
(See Unit-IV, Page 170, Q.20) 
noopy coherence protocol. (See Unit-IV, Page 163, Q.14) 10 
Unit-V 
lain various parallel programming models. 10 
(Sce Unit-V, Page 234, Q.12) 
anguages and explain the features of parallel 
(See Unit-V, Page 242, Q.21) 10 
20 


Parallel software tools (See Unit-V, Page 246, Q.27) 


Object oriented parallel programming model 
(See Unit-V, Page 230, Q.10) 


(See Unit-V, Page 248, Q.31) 
(See Unit-IV, Page 187, Q.40) 


Compare distributed and s 
What is vector processing 


processing. 
Explain messa 


Explain S 


Exp 


Write in brief on parallel | 
language for parallelism. 


Parallel programming environment 
Vector access memory schemes. 


B.E. (Sixth Semester) EXAMINATION, June, 2012 
(Computer Science & Engg. Branch) 
ADVANCE COMPUTERARCHITECTURE 
(CS-605) 


Attempt one question from each Unit. All questions carry equal 
marks. 


Unit-I 

Explain different computer models briefly. (See Unit-I, Page 3, Q.1) 10 
Define the following — 10 
(i) Multiprocessors 
(ii) Multicomputers 
(iii) Multivector. 

(See Unit-I, Page 18, Q.15) 

Or 


Compare static interconnection networks and dynamic 
interconnection networks. (Sce Unit-1, Page 52, Q.47) 10 
9. ’ = 


(5) 


Advance Computer Architecture 


(b) Explain Hardware and Software parallelism 


(See Unit- 
Unit-II TUL Page 22, Q is 
3. (a) Draw VLIW architecture. (See Unit-I] 7 
(b) Describe Interleaved memory organization meee Qu2) lo 
(See Unit-1 l0 
| a I, Page 71, Q23 
- (a) Explain Addressing and Timing protocols 

l (See Unit-] l0 
(b) Describe Backplane Bus system. (See Unie i Heid Q35) 
: | | Unit- PRE 

- (a) Discuss different pipeline designs for processors 
: 10 


: (See Unit-IU, P 

(b) What is Branch handling ? Also write its adivantanes age 132, Q36 
z i 10 
(See Unit-IL, Pa 

Or ge 123, Q.29) 


6. (a) Write Tomasulo’s algorithm. (See Unit-III, Page 122 Q.2 
(b) Draw superscalar pipeline design. (See Unit-III, Page 136, aa, i 
Unit-IV ae 


7. (a) What is snoopy protocols ? Where is it used ? 10 


. (See Unit-LV, Page 163 
sta »Q.14 
(b) Discuss message routing schemes for multicomputer ee ý 


Or (See Unit-IV, Page 171, Q.20) 


8. (a) Write vector processing principles.(See Unit-IV, Page 181, Q.32) 


10 
(b) Define virtual channel. Explain what is its need. 10 
(See Unit-IV, Page 177, Q.27) 

Unit-V 
9. (a) Describe about shared-variable model. 10 


(See Unit-V, Page 219, Q.2) 
(b) Write name of software; tools and environments for parallelism. 10 
(See Unit-V, Page 248, Q.30) 


Or 
10. (a) Draw message-passing model. (See Unit-V, Page 225, Q.5) 10 
(b) Write languages name for parallel programming. ` 10 


(See Unit-V, Page 238, Q.8) 


BE. (Sixth Semester) EXAMINATION, June 2013 
‘ADVANCE COMPUTERARCHITECTURE 
(CS-605) 


Attempt one question from each unit. All questions carry equal 


marks. 


Note : 


Unit-I 
how instruction set, compiler technology, CPU implementation 
trol, and cache and memory hierarchy affect the CPU 


ce and justify the effects in terms of program length, clock 
(See Unit-I, Page 10, Q.7) 


1. (a) Explain 
and con 
performan 
rate and effective CPI. 
Compare data flow and control flow computers. 

(See Unit-I, Page 30, Q.28) 


(a) A 400 MHz processor executing an object code with 2 x 106 


instructions. The program consists of four major types of 
instructions. The instruction mix and the number of cycles (CPI) 


needed for each instruction type are given below — 


err 
l 


Instruction Type 
Arithmetic and logic 
Load/Store with cache hit 


Branch 
Memory reference with cache miss 
(i) Calculate the average CPI when the program is executed on a 


processor. 
(ii) Calculate the MIPS rate. 


(See Unit-I, Page 19, Prob.2) 
(b) Distinguish between medium grain and fine grain multicomputer in 
their architectures and programming requirements. 
(See Unit-I, Page 28, Q.23) 
Unit-II 
3. (a) Explain the difference between superscalar and VLIW architectures 
in terms of hardware and software requirements. 
(See Unit-II, Page 61, Q.14) 
(b) Explain memory capacity planning briefly. 
(See Unit-Il, Page 70, Q.26) 


a a» A 
Advance Computer Architecture 


“ 


$ Advance Computer Architecture 


F 4. (a) Describe interleaved memory organization brief] ad 
> y. : 1 
model briefly. 
eo (See Unit. : riable and message passing 
(0) Distinguish between scalar RISC and supers a a Page 7] mA s explain shared va (See Unit-V, Page 226, Q.6) 
- ‘ à ` . : r F oN. es 
instuction issue, pipeline architecture and processor 7 terms 4 9. the features of parallel language.(See Unit-V, Page 240, Q.20) 
Erfo i jscuss : 
(See Unit-1 maneg | (b) Dıs the following — 
Unite » Page 59. Qu | write short notes Cee acral (See Unit-V, Page 248, Q.30) 
5.. (a) Consider the execution of a program of 15 000 in l 40. 0) Software too (See Unit-V, Page 238, Q.19) 
ae ar pipeline processor with a clock rate of 95 ee bya (b) parallel compilers (See Unit-V, Page 230, Q.10) 
the instruction has five stages. Z. Assume tha (c) Object oriented model. 
5 c 


(i) Calculate the speedup factor in using this 


pipelin 
valent mester) EXAMINATION, June 2014 


the program as compared with 
i ies the use of an equi . 
- : quiv; E. (Sixth Se 
FAH processor with an equal amount of ia. hen : ADVANCE COMPUTERARCHITECI URE 
s Tough 5 
(ii) What are the efficienc (CS-605) 
y and thro Ba oe 
processor ? ughput of this Pipelineg ks 
; i ual marks. 
Jnit- estion from each unit. All questions carry €q 
(b) Explain multifunctional arithmetic pi a SPE MRE üi Prob. Note: eae lue if required 
etic pipelines. Assume data/value if req : . 
; it: 
(See Unit-II, Page 132 i i ters in terms 
ne Compare control-flow, data-flow, and reduction compu a 


6. Consider the following reservati i 
g reservation table for a four stage pipeline With a 1. (a) moet 


clock cycle T = 20 ns — of the program flow mechanis 


(See Unit-I, Page 31, Q.31) 
7 


(b) Explain the following - 
i (i) Computational granularity 


ii) Communication latency. 
| A (See Unit-I, Page 29, Q.26) 
(a) What are the forbidd . es Or : o 
rund latencies and initial collision vector 2 2 Comment on the advantages and disadvantages in control complexity, 
aH ie abe a otential for parallelism and cost effectiveness of the above computer 
Ee Be eee and spe yee ARED (See Unit-1, Page 32, Q.32) 7 
Paired toes (b) Write short note on multistage and combining networks. 7 
(See Unit-III, Page 105, Prob.8) (See Unit-I, Page 49, Q.44) 
7 Explaj Unit-1V Cniel 
- (a) Explain the message routing schemes in multicomputer network. 3. (a) Distinguish between scalar RISC and superscalar RISC in terms of 
b ; (See Unit-IV, Page 170, Q.20) instruction issue, pipeline architecture and processor performance. 
(b) Describe the vector supercomputer architecture with block diagram. (See Unit-II, Page 59, Q.11) 7 
See Unit- Explain the temporal locality, spatial locality and sequential locality 
8. (a) Explai (See Unit-I, Page 17, Q.14) (b) Expla ( ' 
i xplain cache coherence problem and its solutions briefly. associated with program/data access in a memory hierarchy. 7 
(See Unit-II, Page 69, Q.23) 


| . (Sce Unit-IV, Page 163, Q.13 
(b) Discuss the principles of multithreading.(See Unit-IV. Page 204 A 


(8) 


Or 
(9) 


fa». 


Advance Computer Architecture 


4. (a) Explain about addressing and timing protocol] 


(See Uni 
(b) What do you understand b METI, Pa 7 
y coherence ? Explain briefly a R35) 
(See Unit- 
Unit-L1 sth Page Quy 


5. Consider the five stage pipeli 
pipelined proc i 
reservation table — pe Sesiespsated byte follow; 
Ing 


l4 


List the set of forbidden latencies and collision vector 
What is the minimum average latency (MAL) of this pipeline ? 


(c) Draw a state transition diagram. 
(See Unit-IH, Page 104, Prob,7 
Or h 
6. (a) Explain possible data hazards with its resolving techniques 7 
(See Unit-IHI, Page 119 

, , ` 9 Q.25 

(b) Discuss the difference between Tomasulo’s approach and P 
scoreboard techniques of dynamic scheduling. bs 

(See Unit-III, Page 120, Q.27) 

Unit-1V 

7. (a) Describe about vector supercomputer architecture. 7 
(See Unit-I, Page 17, Q.14 

(b) Explain about distributed memory model. a 
(See Unit-IV, Page 196, Q.45)7 

Or 

8. (a) What is the use of snoopy protocol ? Explain. 7 
a (See Unit-IV, Page 163, Q.14) 

(b) Write principles of multithreading. Also writes multithreading issues. 
(See Unit-IV, Page 207, Q.56)7 

Unit-V 
9. (a) Discuss about parallel languages. Also write its features 7 
j (See Unit-V, P 
(b) Explain object oriented model. za 


(See Unit-V, Page 230, Q.10) 7 
(10) 


Advance Computer Architecture 


Or 


rt note on message passing programming model. 1 


40 (a) Write sho (See Unit-V, Page 225, Q.5) 

. . 7 
; s of parallel programming environment. 

ti a ae . (See Unit-V, Page 247, Q.29) 


ADVANCE COMPUTERARCHITECTURE 
(CS-605) 


. . l 
(i) Attempt one question from each unit. Each unit have equa 


marks. 
(ii) Assume data/value if required. 
Unit-I l i 
least four characteristics of MIMD multiprocessors tha 


7 
m from multiple computer systems. 
(See Unit-I, Page 6, Q.2) 


ynamic connection networks. 7 
(See Unit-I, Page 52, Q.47) 


Note z 


1. (a) Describe at 
distinguish the 


(b) Distinguish between static and d 


Or , 
a and crossbar networks. 
(See Unit-I, Page 52, Q.46) 


e on multistage connection networks. 7 
(See Unit-I, Page 49, Q.44) 


2. (a) Distinguish between omes 


(b) Write short not 


Unit-II 
n the difference between superscalar and VLIW architectures 
s of hardware and software requirements. 7 
(See Unit-II, Page 61, Q.14) 
he instruction set architecture in RISC and CISC processors 


f instruction formats, addressing modes, and cycles per 
(See Unit-II, Page 56, Q.4) 7 


3. (a) Explai 
in term 


(b) Compare t 
in terms O 
instruction. 

Or 


(a) Explain about arbitration, transaction and interrupt. 7 
(See Unit-II, Page 84, Q.38) 

(œ) Explain the following terms associated with cache design — 7 

i) Write through versus write back caches 

(See Unit-IV, Page 153, Q.4) 


4. 


TEE TE 822005 RR TET aA 


ur 


Advance Computer Architecture 


(ii) Factors affecting cache hit Tatios (See U 
. nit- 


Unit-III 


ee Ys Page 155 
5. Consider the following pipeline reservation table 


(a) What are the forbidden latencies ? 
(b) Draw the state transition diagram. 


(c) Determine the optimal constant lat d 
ency cycle an ini 
TA i l y cy the minimal @Verage 


(d) Let the pipeline clock period be T = 20 ns. Determine t 


eee as he th 
of this pipeline. (See Unit-III, Page pay, 
Or 4) 
6. (a) Write Tomasulo’s algorithm. (See Unit-IIl, Page 122, Q.28)7 
(b) Describe about branch handling techniques. 7 
wane (See Unit-HI, Page 123, Q.29) 
7. (a) Explain the following terms related to vector processing — 7 
G) Vector and scalar balance point (See Unit-IV, Page 185, Q.35) 
(ii) Vectorization compiler. (See Unit-IV, Page 194, Q.43) 
(b) Explain about directory based protocols. 7 
(See Unit-1V, Page 166, Q.17) 
Or 
8. (a) Describe message routing schemes in multicomputer network. 7 
(See Unit-1V, Page 171, Q.20) 
(b) Discuss the design space for granularity and connectivity of SIMD 
systems. (See Unit-1V, Page 203, Q.52) 7 
Unit-V 
9. 


(a) Describe about function and logic models. 7 
(See Unit-V, Page 233, Q.11) 


(b) Discuss features of parallelism. (See Unit-V, Page 240, Q.20) j 

Or 

10. (a) Explain about data-parallel model. (See Unit-V, Page 227, 
(b) Explain about parallel programming environment and tools. 


Q.8)7 


(See Unit-V, Page 248, Q.31) 


| 
| 
| 


| 
| 


| 
\ 


\ 


| 
| 
| 


\ 


Note : 


1. 


2: 


3. 


5 
ATION, June 201 
‘th Semester) EXAMIN. 
ASAN CE COMPUTER ARCHITECTURE 


i C is 
i tion part A, B, 
Answer five questions. In each ee P 
(i) compulsory and D part has internal cho} ie eee 
) All parts of each questions are to be slaps ss a pee 
ja ] marks, out of W 
iii tions carry equa > I catty 
Ve a er words) carry 2 marks, part C (Max. ma ) 
: mae part D (Max. 400 words) carry 7 mar z a 
ivati i win : 
iv) Except numericals, Derivation, Design and Dra g 
, a it-I, Page 25 Q.20) 
What is instruction level parallelism ? (See ymi; Zs ii a eH 
a What is the use of branch target buffer ? (See eae ag 5 
v i i d fine grain ? 
i i rse grain an K 
(c) What is grain packing, coa gr pis ae 0.22) 
heir merits 
i z i models and compare t 
(d) Explain three parallel architecture r ME HES enon 
and demerits. 
Or a 
ic i i tworks. 
B i static and dynamic interconnection ne 
Explain the static y ne mies 
Unit-Il fa 
i re eae D5 ; i 
(a) What is memory interleaving ? (See Uñit Il, Page ds . 
(b) What are the limitations of VLIW ? (See a he Page 61, Q. 
i i J hierarchy. 
ixplain locality of reference and memory 
ae (See Unit-I, Page 70, Q.24) 
(d) What is RISC attributes and discuss the advantages of RISC in 
comparison with other architecture. (See Unit-Il, Page 58, Q.7) 
Or 
Explain addressing and timing protocols briefly. 
(See Unit-II, Page 78, Q.35) 


Unit-III 


(a) What is forbidden latency ? (See Unit-III, Page 93, Q.9) 
(b) Differentiate between linear pipeline processor and non-linear pipeline 
processor. 


à (See Unit-II, Page 91, Q.7) 
(c) Explain branch handling techniques. 


(See Unit-III, Page 123, Q.29) 


(13) 


i A AEH 


Advance Computer Architecture 


(b) 
(c) 


(d) 


5; (a) 
(b) 
(e) 
(d) 


Note : 


servati 
ervation table 


iii) State transition diagr 
(iv) MAL, 


am 


(Sce Unit-111, p 


Explain how Or age 106, p 
t rCome s Pro 
using a p; data hazards with d a D9) 
0's approach, (Se ynamic sched i 
Wh Unit-IV See Unit-HI, Pape 123 e 
at is multithreading? Q2) 
What is shi i (See Unit- 
Eeli shared memory model ? ee A IV, Page 206,054 
plain vector memory access schemes nit-IV, Page 196,09 
What is meant b (See Uni 
` f y cache coherence n t-IV, Page 187 
protocols for cache on N problems ? Describe ee 


a (Sce Unit-IV, Pape 170, Q.19) 


D t r 
` C ector S p d 
escribe th y upercom uter architecture with neat diagra 
agram. 


Unit-V (See Unit-l, Page 17, Q.14) 


What are CRCW 

; and CREW * 

Give an example 7 EY ? (Sce Unit-V, Page 23 

What is functional o languages. (See Unit-V, P i DEO 
Salka al and logic models ? (See Unit V age 238, Q.18) 

Discuss the advantages of various mi Unit-V, Page 233, Q.11) 

Jxplai l Or 7 

Explain shared variable and message passing model 


(See Unit-V, Page 226, Q.6) 


“ne Semester) EXAMINATION, Dec. 2015 
VANCE COMPUTERARCHITECTURE 
(CS-605) 


(i) Answer five i 
questions. Ine : 
compulsory. ach question part A, B, C and D is 


(ii) All parts of each questions are to be attempted at one pl 
e place. 


(14) 


er Architecture 


Advance Comput 


out of which part A and B 


al marks, 
C (Max. 100 words) carry 


wang carry €q 
2 marks, part 


u 
pi) AU 4 ds) carry | 
(Max. n we D (Max. 400 words) carry 7 marks. 
gman ke als Derivation, Design and Drawing etc, 


Ww) a Unit-! 
data flow graph ? (See Unit-I, Page 30, rat 
at is OF" r PA eT Pape 0.2 
J. (a) Ka k grain size and latency ? (See Unit anil 28, Q.24) 
o n : tiate between software and hardware parallelism. 
mee (See Unit-I, Page 23, Q.18) 
40 MHz processor was used to execute bench mark program with 
O lowing instruction mix and clock cycle counts. 
45000 
2000 
Dati 32( 
Floating point 15000 
8000 


Control transfer 


MIPS rate and execution time. 


ne the effective CPI, 
(See Unit-I, Page 20, Prob.3) 


Determi 
Or 


ixplai i 5S ic briefly. 
Explain multiprocessor and multicomputer 

: (See Unit-I, Page 12, Q.10) 
Unit-II 

2 ((a) Define the term hit ratio and miss rato. 
9 


(b) What is fault tolerance : 
(c) Explain the memory inte 


(See Unit-HI, Page 71, Q.27) 
(See Unit-H, Page 75, Q.33) 
rleaving technique. 
(See Unit-H, Page 71, Q.28) 
e characteristics of CISC and RISC 


d compare th 
(See Unit-II, Page 58, Q.6) 


(d) Discuss an 
architectures. P 
plane bus system briefly. (See Unit-Il, Page 76, Q.34) 
Unit-III 
3 What are the major hurdles of pipelining ? 
. (a) (See Unit-II1, Page 116, Q.20) 


e use of branch target buffer ? 


Explain Back 


(b) What is th 

xt (See Unit-III, Page 124, Q.30) 
©) How super scalar processing is fast compared to parallel processor ? 
(d) Explain internal data forwarding and possible hazard between read 


and write operations in the context of instruction pipelining. 
(See Unit-II, Page 118, Q.21) 


(15) 


Perpipe| 
, (See Unit- a Ine design 
Unit-IV RRENGO 
4. (a) What is cache coherence ? (See Uni i 
(b) What is deadlock ? Bee üna T 149,0) 
(c) Explain vector access memory schemes. SBE 176, Q.26) 
(See Unit-IV, p 

(d) PAA the concept of bus snooping by cache oherenn 187, Q.40) 
e iie 
coherence. (See Unit-ty, Page gêi 
Or 3 Q14 

Describe the message routing schemes in multi computer nety, 

; ne 

(See Unit-Iv, Page 17] ae 
Unit-V ee 


5. (a) What is object oriented model ? (See Unit-V, Pape 2 
(b) What is parallel compiler ? (See Unity; Pa a rar pi 
(c) Discuss about the language features for parallelism, eee 
See Unit- age 
(d) Explain message passing model briefly. pa aaen 
a ’ + Q.5) 
Discuss about the software tools and environment. 


(See Unit-y, Page 248, Q.30) 


[osama] 


B.E. (Sixth Semester) EXAMINATION, June 2016 
ADVANCE COMPUTERARCHITECTURE 
(CS-605) 


Note: 


(i) Answer five questions. In each question part A, B, C is 
compulsory and D part has internal choice, 

(ii) All parts of each questions are to be attempted at one place. 

(iii) All questions carry equal marks, out of which part A and B 
Sil 50 words) carry 2 marks, part C (Max. 100 words) carry 

marks, part D (Max. 400 words) carry 7 marks 
(iv) Except numericals, Derivation, D 
j Unit-I 
1. (a) Briefly describe hardware and software.parallelism 


esign and Drawing etc. 


o (See Unit-I, Page 23, Q.18) 


Advance Computer Architecture 


fine latency and throughput of pipeline. (See Unit-IN, Page 94, Q.11) 
in 


s the need of higher performance computers ? 
(See Unit-I, Page 7, Q.3) 


on based on multiplicity of instruction 
(See Unit-I, Page 3, Q.1) 


Oe” 
(c) What 1 


plain F lynn’s classificati 


streams and data streams. 
Or 


(d) Ex 


Distinguish between multiprocessors and multicomputers based on 
their structure, resource sharing and interprocessor communication. 
(See Unit-I, Page 12, Q.10) 
Unit-II 
) How many types of vector instruction are there ? 
(See Unit-IV, Page 187, Q.37) 


ortance of memory consistency model ? 
(See Unit-IV, Page 202, Q.50) 


Ae 
(b) What is the imp 


(c) Define the terms : Access time, bandwidth. 
(See Unit-II, Page 75, Q.32) 


he characteristics of RISC and CISC 


Discuss and compare t 
r (See Unit-II, Page 58, Q.6) 


architecture. 
Or 


What is the basic concept of VLIW approach ? 
(See Unit-Il, Page 59, Q.12) 


Unit-I1 

3. (a) What is pipeline CPI? ERA (See Unit-I, Page 9, Q.6) 
(b) Explain multifunctional arithmetic pipelines. 

(See Unit-III, Page 132, Q.38) 


(c) Explain Tomasulo’s algorithm. (See Unit-II, Page 122, Q.28) 


(d) Discuss different pipeline design for processor. 
(See Unit-II1, Page 132, Q.36) 


Or 
Explain about data and control hazards and internal forwarding and 
register tagging. (See Unit-LUI, Page 118, Q.22) 
Unit-IV 
(a) Describe the we levels of threads, (See Unit-IV, Page 214, Q.62) 
(b) Discuss the directory based cache coherence protocol, 
l (See Unit-IV, Page 165, Q.16) 
(c) Explain the models of memory consistency. 


(See Unit-IV, Page 202, Q.51 ) 


Advance Compute 


5. 


Note : 


t 


(d) Li 


(a) Disc 


(c) Discuss Flynn’s Classification. 
d) Explaini i i i 
(d) plain in detail the various performance metrics for communication 


r Architecture 
es to cache coherence protocol. 


st two approach ea 
“(See Unit-I V, Page 
? ge 170, Q.1 8) 


Or 
What are snoopy protocols 2 When is it used ? 
(See Unit-IV, Page 1 

Unit-V EREN 

rallel language. 
(See Unit-V, Page 240, Q.20) 
(See Unit-V, Page 236, Q.16) 
(See Unit-V, Page 226, Q.7) 
heduling models for multiprocessor 
(See Unit-IvV, Page 214, Q.61) 


uss the features of pa 


(b) State and prove Amdahl’s law. 
{c) Explain Array processing. 
(d) Discuss about deterministic sc 


system- 
Or 


ine vector processing methods. 


Explain the various pipel 
(See Unit-IV, Page 184, Q.34) 


pe] 


B.E. (Sixth Semester) Examination, Dec. 2016 
ADVANCE COMPUTER ARCHITECTURE 
(CS-605) 


G) Answer five questions. In each question part A, B, C is 
compulsory and D part has internal choice. 
) All parts of each questions are to be attempted at one place. 
i) All questions carry equal marks, out of which part A and B 
(Max. 50 words) carry 2 marks, part C (Max. 100 words) carry 
3 marks, part D (Max. 400 words) carry 7 marks. 
(iv) Except numericals, Derivation, Design and Drawing etc. 
(a) What do you mean by grain size and latency ? 
doe (See Unit-l, Page 28, 24 
(b) Distinguish between static interconnection network ee ee 
(See Unit-1, Page 52, Q.47) 
(See Unit-I, Page 35.Q:1) 


€ 


ii 
(ii 


interconnection network. 


Arsana and discuss their advantages and challenges of parallel 
E- (See Unit-L Page 29, Q-25) 
Or 


Consi i P 
ider the execution of the following code segement consisting of 


(18) 


Raahe 


j 
i. } 
$4 
aid 
ad 


3: 


(a) 


(b) 


(c) 


(d) 


a) 
(b) 


the maximum 
ns that can be 
t be executed 


Use Bemstein’s conditions to detect 


edded in this code. Justify the portio 


nd the remaining portions 


five statements. 
parallelism emb 
executed in parallel a 
in sequentially. 
s1:c=D*%* E 
s2:M=GrT C 
S3:A=Br Cc 
s4:C=L+M 


s5:F=Gr E 
I, Page 32, Prob.4) 


(See Unit- 


ar and VLIW processors. 


Compare superscal 
(See Unit-H, Page 61, Q-15) 


ve the degree of parallelism ? 

(See Unit-I, Page 25,Q.21) 
e Unit-, Page 63, Q-17) 
in a virtual memory. 


Define how we can achie 


een SRAM and DRAM. (Se 
slation mechanism 
(See Unit-H, Page 67, Q.22) 


Distinghuish betw 
Explain various address tran 


Or 
Explain how instruction set, mem 
CPU implementation and contro 
justify the effect in terms of program 


CPI. 
What is Dynamic instruction sc 


ory hierarchy, compiler technology, 

1 effect the CPU performance and 

length, clock rate and effective 

(See Unit-I, Page 10, Q.7) 
heduling ? 

(See Unit-HI, Page 120, Q.27) 

e the performance of the superscalar processor and 

(See Unit-MI, Page 146, Q.47) 

up of a pipeline is equal to its stages. 

(See Unit-Ill, Page 89, Q.5) 

With non-linear processors, explain pipelining with latency analysis, 


make use of relevant state diagrams whenever required. 
(See Unit-HI, Page 91, Q.8) 


Compar 
superpipelined processor. 


Explain that the maximum speed- 


Or 

What do you understand by delayed branch approach of jump 
instruction in the instruction pipeline discuss with suitable examples ? 
; (See Unit-I1, Page 127, Q.34) 
ae the cache performance issues. (See Unit-IV, Page 155, Q.8) 
a oa inclusion property and memory coherence requirements 
multilevel memory hierarchy. (See Unit-II, Page 65, Q 20) 

om JQ- 


Computer Archiecure 
Adran Carper HÝ — 
< ond short Imes In - ; een wine through anc write 
whe ava ae A se in plemen pierarchy. Disuingnisa betw is 
z2- ond gjopziiv ared z STAR tin. pIeT UT a ra 17 2012Ce! IISc- a 
virtual memories 06 2 20087 shared virtua MS pr, {memory Ons coherence m egati 11 Page 65, Q.20) 
(See ts me Ty, wiles - pamanis (See Unit- ras 2 
c mubicompaics SAA Unity pa ory e, policies ™ cated with cache and memory 
ance Co a en et tee toreinOIOZIES B8S0Cizted wat BE 2H nck s associatec ” 
(d) Expkein the folowing teammonaes associated with SIMD. > aa gollowing terms 355° 
(dy Li = be soming function te, 4) zn the 
ms AER d ; oñ es) expla ne . [nterieaving 1a cache 
O ee ae ieie p a orcs pene he versus VII mal aedes =e II. Page 73. Q.29) 
E) Shae aces wd omega networks, (a) Loreal address Carte oe CP at are 
W chi Io rs ee Uns poe + hy the performat- - i 5 
n° Gee Uaitt pe. j ag understand by he pi oa 7U Pagea aa 
d, Or Be 53, Q qnar do you for measuring ine pr = and r zthods to overcome. 
1? - : Lt ig 2s used HA a oline hazar ia < 5 
What is vector processing ? Give some example. ) re measur | the various pipelin (See Unit-IIL. Page 119, 025) 
5. (a) processing. Also discuss some primitive vec1, f ven ¢ Explain © | parallelism within a processor can © ee ea 
p - . H at Sf s =, cus a = rez di Z Ss += 
pcan (See Unit-ry, Page j eis, gaplan bow thread = Explain simultaneous mes ša 
2 a gia Ppa = — d “ag - } Jat jaoya ae meS m s rot hte = x 
(b) 5. (a) Define the term loop skewing (See Unit-v, py, OEN i. ib suitable cae performance enhancement HIN, page 215, Q.63) 
(c) (b) What zre the features of control of parallelism 7 ora, Oh challenges and po e 
(dj (See Unit-y, ‘ fthe followmg- Ate 
a ee “V, Paz , y four O1 -1 processing levels- 
(¢) What is an ideal situation for user in programming Phi a 242, Oy 8 Ane ne the various parallel processiRe (sce Unit-L, Page 23, sy 
1 z o] e ai Cio 7 7 hite e 
Give a suitable example. (See Unit-y Pa a Compe (a) difference between super scalar and VLIW arcaitec 
(d) Write sh 5 4 Bares 1 abe Qde ! in the differe a irements- 
J short notes on parallel programming models, O% (b) a of hardw:-* and software requ Ieee Unit-II, Page 61. Q.14) 
(See Unit-V, p : - ic pipelines- 
sra 7 arithmetic pipe 
4 Or Be 234, Oy, (c) Write a short note on multifunction (See Unit-IL, Page ot ae 
explain the following loop transformatio : ang instructions related to 
ae : ns and disc atter and masking : 
ther for Joop vectorization of parallelization — SCUSS how t ppp | (d) Explain the term gather, s€ (See Unit-IV, Page 186, Q.36) 


(i ) Loop permutation (ii) Loop reversal 
Gii) Locality optimization (iv) Software pipelining 
g 


(See Unit-v, Pape 243 0.4 
ad nD ) 


B.E. (Sixth Semester) E 
-xamination, Ju 
ADVANCE COMPUT, ERARCHIT. ECTURe y 
7 (CS-605) 


| Note; (i) Attempt any five 
(ii) All questions 

np ez 

What do i 


questions, 

ITY equal marks 
ou orsta i 

you understand by the Performance of the 
asuring the Program 
? How it is j 


j f; 


a pipeline ? What are 

! (See Unit-1, Page 9,Q,5) 

pace in uniprocessor 

See Unit-V, Pape 

emory cohere A K p MA 
ence requirements in a 


vector processin g 


RG 
language features for parallelism - 


(See Unit-V, Page 240, Q20) 
| of parallel programming - 
assume. 

t-V, Page 235, Q.14) 


(e) What are the 


z de 
mean by tuple space mo : 
for any task graph you 4 

(See Uni 


(f) What do you 
Write a Linda program 


CS-605 (GS) 
(VI Semester) EXAMINATION, Dec. 2017 
Grading System (GS) 
ADVANCE COMPUTERARCHIT ECTURE 


B.E. 


(i) Attempt any five questions. 

(ii) All questions carry equal marks. 
1. (a) State the CPU performance equation and discuss the factor that 
effect performance. (See Unit-I, Page 11, Q.8) 7 


(b) Explain Flynn’s classification of computer with the help of neat 
diagram. (See Unit-I, Page 3, Q.1) 7 


2. (a) Explain data path and its control in detail. 7 
(See Unit-III, Page 137, Q.41) 


Note: 


(21) 


Advance 


a 


Computer Architecture 


rd? Explain its types with suitable example 


(b) what is Haza 
(See Unit-III, Page tig ? 


(a) Explain the following- i 
(i) Crossbar switch (See Unit-I, Page 44 Q l 
(i) Multiport memory: (See Unit-I, Page 47 a 
(b) Differentiate between CISC scalar processors and RIgç ” 
processors. (>er Cnil, Bages, one 

(a) Discuss in detail about the performance issues in SYMmetri i 
distributed shared memory architectures. € ang 
(See Unit-IV, Page 197 ; 7 

. . . ? 4 

(b) Whatare the different way for branch prediction ? Discuss how pj > 
performance issues can be reduced by branch prediction. Peline 

(See Unit-IIl, Page 126 eis 

(a) Briefly explain how to overcome data hazards with dyn, 3 
scheduling using Tomasulo’s approach. S 

(See Unit-III, Page 123 Q2) 

(b) Explain the design of super pipeline processor with diagram ] 
(See Unit-III, Page 146, Q.46) 

(a) Howis multithreading used to exploit threat level parallelism withj 
a processor ? Explain with example.(See Unit-IV, Page 217, ast 

(b) Differentiate between distributed memory model and shared memory 

model. (See Unit-LV, Page 196, Q.47)7 

(a) Describe the language features for parallesism. 7 

| | (See Unit-V, Page 240, Q.20) 

(b) Explain shared variable model and message passing model in detail.7 


(See Unit-V, Page 226, Q.6) 
Write short note (any four) - 

(See Unit-V, Page 233, Q.11) 
(See Unit-1V, Page 208, Q.58) 
(See Unit-IV, Page 176, Q.26) 

(See Unit-IV, Page 149, Q1) 

(Sce Unit-II, Page 71, Q28) 


(a) Functional and logic model 
(b) Multiple context processors 
(c) Deadlock 

(d) Cache coherence 


(e) Memory interleaving. 
(22) 


B.E. (VI Semester) E 
Choice Based Grading Syst 


estion carries equal marks. 


Answer any five questions,each qu 


ume suitable data if missing. 
data dependencies among the followin 


/R1=-1024/ 
/R2<-Memory (1 0) 
Ri- (RI) + (R2)/ 
/Memory (1024) — (RIY 
(64) ~- 1024/ 


(i) 
(ii) Ass 
(a) Analyze the 
! g1] : Load Rl, 1024 

S2 : Load R2, M (10) 
s3 : Add RI, R2 


s4 : Store M (1024), R1 
S5 ; Store M ((R2)), 1024 /Memory 


Note that (Ri) means tha 
(10) contains 64 initially. 


(i) Draw a dependance graph to show all th 


(ii) Are there any resource dependencies if only one copy of 
savailableinCPU? (See Unit-I, Page 34, Prob.5) 


and dynamic interconnection 
network diameter and bisection 

(See Unit-I, Page 47, Q.42) 4 
cessor specified by the following 


Note : 
g statements — 10 


1; 


t the content of register Ri and memory 


e dependencies. 
each 


functional unit i 
Compare and comment on static 
network in terms of node degree, 
width. 


Consider the five stage pipelined pro 
reservation table — 


(b) 


2. 


List the set of forbidden latencies and collision vector. 


| (a) 
| (b) Draw a state transition diagram showing all possible initial sequences 
' without causing a collision in the pipeline. 
i (c) List all the simple cycles 
(d) Identify the greedy cycles 


(23) 


Advance Computer Architecture 


(e) What is the MAL of the epee a eel = 100, toba) 

3. (a) With respect to the ee ak ea ea XPlain 
i | data forwarding an nm an writ 

interna . (See Unit-Il, Page tig a 

p Frain ioe a 

of non-pipelined serial ORR, iA ABEBI, Qg 

4. (a) Define vector processing and its instruction types. Also ẹ 


; Xplain Ga h 
Scatter and Masking instructions in Cray mMICTOPTOCeSsor , 
catter 5 


7 
(See Unit-Iv, Page 187, Q.38) 
(b) What is cache coherence protocol ? Explain once Goodman’ Write 


once cache coherence protocol. (See Unit-IV, Page 59, te 
5. (a) Write and explain four operational modes used jn ro 
| - multiprocessor system by giving an example of each. 


(See Unit-v, Page 224, Q3 
Wormhole routing and 
essage-passing mechanis 
(See Unit-Iv, 
Draw and explain block diagram of backpl 
describe bus arbitration and control, (See U 


al locality 
ata access in 


(b) Explain store-and-forward routing. its 

handshaking protocol associated with m m.7 
Page 174, Q.24) 
ane bus System, Also, 
nit-I1, Page 76, Q.34)7 
and sequential locality 
a Memory hierarchy, 


7 
(See Unit-IT, Page 69, Q.23) 


nd multi-computers based 
aring and iNterprocessor 


(See Unit-1, Page 12, Q.10)7 
Property and memo; 


n write through an 


(b) Explain the temporal locality, spati 
associated with program) d 


Distinguish between multiprocessors a 
On their Structures, resource sh 
communication. 


(b) What are inclusion Yy coherence requirements ? 
Distinguish betwee d write back policies, 7 


(See Unit-1f, Page 66, Q.20) 
8. Write shon notes on following (any three) — l4 
(a) VLIW architecture (See Unit-Hf, Page 59, Q.12) 
(b) SIMD super Computer (See Unit-1, Page 18, 0.16) 
(c) Snoopy bus Protoco] 


(See Unit-TV, Page 163, 0.14) 
(d)  Tomasulo’s algorithm, 


(See Unit-11, Page 123, Q.28) 


